Jointly Training Large Autoregressive Multimodal Models
- URL: http://arxiv.org/abs/2309.15564v2
- Date: Thu, 28 Sep 2023 17:23:11 GMT
- Title: Jointly Training Large Autoregressive Multimodal Models
- Authors: Emanuele Aiello, Lili Yu, Yixin Nie, Armen Aghajanyan, Barlas Oguz
- Abstract summary: We present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models.
We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks.
Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs.
- Score: 37.32912103934043
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, advances in the large-scale pretraining of language and
text-to-image models have revolutionized the field of machine learning. Yet,
integrating these two modalities into a single, robust model capable of
generating seamless multimodal outputs remains a significant challenge. To
address this gap, we present the Joint Autoregressive Mixture (JAM) framework,
a modular approach that systematically fuses existing text and image generation
models. We also introduce a specialized, data-efficient instruction-tuning
strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned
model demonstrates unparalleled performance in generating high-quality
multimodal outputs and represents the first model explicitly designed for this
purpose.
Related papers
- ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation [27.773146599559286]
Anole is an open, autoregressive, native large multimodal model for interleaved image-text generation.
We have open-sourced our model, training framework, and instruction tuning data.
arXiv Detail & Related papers (2024-07-08T17:08:02Z) - DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception [66.88792390480343]
We propose DEEM, a simple and effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder.
DEEM exhibits enhanced robustness and a superior capacity to alleviate hallucinations while utilizing fewer trainable parameters, less pre-training data, and a smaller base model size.
arXiv Detail & Related papers (2024-05-24T05:46:04Z) - EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods.
EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z) - MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models [34.611309081801345]
Large diffusion-based Text-to-Image (T2I) models have shown impressive generative powers for text-to-image generation.
In this paper, we propose a novel strategy to scale a generative model across new tasks with minimal compute.
arXiv Detail & Related papers (2024-04-15T17:55:56Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - Improving Non-autoregressive Generation with Mixup Training [51.61038444990301]
We present a non-autoregressive generation model based on pre-trained transformer models.
We propose a simple and effective iterative training method called MIx Source and pseudo Target.
Our experiments on three generation benchmarks including question generation, summarization and paraphrase generation, show that the proposed framework achieves the new state-of-the-art results.
arXiv Detail & Related papers (2021-10-21T13:04:21Z) - OCHADAI-KYODAI at SemEval-2021 Task 1: Enhancing Model Generalization
and Robustness for Lexical Complexity Prediction [8.066349353140819]
We propose an ensemble model for predicting the lexical complexity of words and multiword expressions.
The model receives as input a sentence with a target word or MWEand outputs its complexity score.
Our model achieved competitive results and ranked among the top-10 systems in both sub-tasks.
arXiv Detail & Related papers (2021-05-12T09:27:46Z) - InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [76.32065400614162]
We propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6.
The model owns strong capability of modeling interaction between the information flows of different modalities.
We propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model.
arXiv Detail & Related papers (2020-03-30T03:13:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.