Related papers: AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

URL: http://arxiv.org/abs/2402.12226v3
Date: Thu, 7 Mar 2024 06:31:46 GMT
Title: AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Authors: Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yugang Jiang, Xipeng Qiu
Abstract summary: We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities. We build a multimodal text-centric dataset for multimodal alignment pre-training. We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
Score: 115.89786751297348
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/

Related papers

MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model [57.89395815934156]
Multi-Turn Contrastive Learning (MuCo) is a dialogue-inspired framework that revisits this process.<n>Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T)
arXiv Detail & Related papers (2026-02-06T05:18:33Z)
CACARA: Cross-Modal Alignment Leveraging a Text-Centric Approach for Cost-Effective Multimodal and Multilingual Learning [6.162206820356373]
We propose a multimodal and multilingual architecture, CACARA, trained through emergent alignment learning.<n>By fine-tuning the newly incorporated modality only on data aligned with the English language, our model develops support for over 100 languages.<n>Our strategy achieves up to a 14.24 percentage points improvement in R@1 audio-to-text retrieval, outperforming state-of-the-art multimodal models.
arXiv Detail & Related papers (2025-11-29T14:04:27Z)
OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. Our main findings reveal that most OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts. To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to multimodal contexts.
arXiv Detail & Related papers (2024-09-23T17:59:05Z)
Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing [17.92378239787507]
We present a decoder-only Discrete Multimodal Language Model (DMLM) DMLM can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision) Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training.
arXiv Detail & Related papers (2024-06-04T20:08:25Z)
U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics. We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features. Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z)
NExT-GPT: Any-to-Any Multimodal LLM [75.5656492989924]
We present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. We introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation.
arXiv Detail & Related papers (2023-09-11T15:02:25Z)
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information. Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations. We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z)
MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation [21.455774034659978]
MultiFusion allows one to express complex concepts with arbitrarily interleaved inputs of multiple modalities and languages. MutliFusion leverages pre-trained models and aligns them for integration into a cohesive system.
arXiv Detail & Related papers (2023-05-24T16:22:18Z)
Unsupervised Multimodal Language Representations using Convolutional Autoencoders [5.464072883537924]
We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks. We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets. It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
arXiv Detail & Related papers (2021-10-06T18:28:07Z)
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training [119.16007395162431]
M3P is a Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre-training. We show that M3P can achieve comparable results for English and new state-of-the-art results for non-English languages.
arXiv Detail & Related papers (2020-06-04T03:54:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.