Language Models are General-Purpose Interfaces
- URL: http://arxiv.org/abs/2206.06336v1
- Date: Mon, 13 Jun 2022 17:34:22 GMT
- Title: Language Models are General-Purpose Interfaces
- Authors: Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang,
Shuming Ma, Furu Wei
- Abstract summary: We propose to use language models as a general-purpose interface to various foundation models.
A collection of pretrained encoders perceive diverse modalities (such as vision, and language)
We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders.
- Score: 109.45478241369655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Foundation models have received much attention due to their effectiveness
across a broad range of downstream applications. Though there is a big
convergence in terms of architecture, most pretrained models are typically
still developed for specific tasks or modalities. In this work, we propose to
use language models as a general-purpose interface to various foundation
models. A collection of pretrained encoders perceive diverse modalities (such
as vision, and language), and they dock with a language model that plays the
role of a universal task layer. We propose a semi-causal language modeling
objective to jointly pretrain the interface and the modular encoders. We
subsume the advantages and capabilities from both causal and non-causal
modeling, thereby combining the best of two worlds. Specifically, the proposed
method not only inherits the capabilities of in-context learning and open-ended
generation from causal language modeling, but also is conducive to finetuning
because of the bidirectional encoders. More importantly, our approach
seamlessly unlocks the combinations of the above capabilities, e.g., enabling
in-context learning or instruction following with finetuned encoders.
Experimental results across various language-only and vision-language
benchmarks show that our model outperforms or is competitive with specialized
models on finetuning, zero-shot generalization, and few-shot learning.
Related papers
- IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities [4.269326314400742]
We introduce the Inner-Adaptor Architecture for multimodal large language models (MLLMs)
The architecture incorporates multiple multimodal adaptors at varying depths within the large language model to facilitate direct interaction with the inherently text-oriented transformer layers.
Unlike previous approaches of freezing language models that require large-scale aligned data, our proposed architecture is able to achieve superior performance on small-scale datasets.
arXiv Detail & Related papers (2024-08-23T08:10:13Z) - Collaborative decoding of critical tokens for boosting factuality of
large language models [57.504894664689]
Finetuned and aligned models show improved abilities of instruction following and safe generation.
The common practice of using sampling during generation also increases chances of hallucination.
We introduce a collaborative decoding framework to harness the high factuality within pretrained models through the concept of critical tokens.
arXiv Detail & Related papers (2024-02-28T01:53:37Z) - Language Models are Universal Embedders [48.12992614723464]
We show that pre-trained transformer decoders can embed universally when finetuned on limited English data.
Our models achieve competitive performance on different embedding tasks by minimal training data.
These results provide evidence of a promising path towards building powerful unified embedders.
arXiv Detail & Related papers (2023-10-12T11:25:46Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Grafting Pre-trained Models for Multimodal Headline Generation [12.063053852096514]
Multimodal headline utilizes both video frames and transcripts to generate the natural language title of the videos.
Previous researches on pre-trained language models and video-language models have achieved significant progress in related downstream tasks.
We propose a novel approach to graft the video encoder from the pre-trained video-language model on the generative pre-trained language model.
arXiv Detail & Related papers (2022-11-14T08:59:59Z) - DIRECTOR: Generator-Classifiers For Supervised Language Modeling [27.86870968048833]
Current language models achieve low perplexity but their resulting generations still suffer from toxic responses, repetitiveness and contradictions.
We introduce a new architecture, sc Director, that consists of a unified generator-classifier with both a language modeling and a classification head for each output token.
arXiv Detail & Related papers (2022-06-15T17:44:08Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - What Language Model Architecture and Pretraining Objective Work Best for
Zero-Shot Generalization? [50.84738303888189]
We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization.
We train models with over 5 billion parameters for more than 170 billion tokens.
We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
arXiv Detail & Related papers (2022-04-12T14:19:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.