VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
- URL: http://arxiv.org/abs/2501.01957v3
- Date: Tue, 21 Jan 2025 15:36:41 GMT
- Title: VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
- Authors: Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He,
- Abstract summary: We propose a multi-stage training methodology that progressively trains LLM to understand both visual and speech information.<n>Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities.<n>By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities.
- Score: 105.88658935310605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.
Related papers
- SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering [0.0]
We introduce SViQA, a unified speech-vision model that processes spoken questions without text transcription.
Building upon the LLaVA architecture, our framework bridges auditory and visual modalities through two key innovations.
Extensive experimental results on the SBVQA benchmark demonstrate the proposed SViQA's state-of-the-art performance.
arXiv Detail & Related papers (2025-04-01T07:15:32Z) - Vision-Speech Models: Teaching Speech Models to Converse about Images [67.62394024470528]
We introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules.
An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics.
We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis.
arXiv Detail & Related papers (2025-03-19T18:40:45Z) - Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning [32.95008932216176]
We introduce MMDiag, a multi-turn multimodal dialogue dataset.
We also present DiagNote, an MLLM equipped with multimodal grounding and reasoning capabilities.
arXiv Detail & Related papers (2025-03-10T07:32:53Z) - Investigating and Enhancing Vision-Audio Capability in Omnimodal Large Language Models [20.210120763433167]
We propose a Self-Knowledge Distillation (Self-KD) training method where the vision-text component of the OLLM serves as the teacher and the vision-audio component as the student.
Our experimental results demonstrate that Self-KD is an effective method for enhancing the vision-audio capabilities of OLLMs.
arXiv Detail & Related papers (2025-02-27T02:19:09Z) - EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions [152.41217651729738]
GPT-4o is an omni-modal model that enables vocal conversations with diverse emotions and tones.
We propose EMOVA to enable Large Language Models with end-to-end speech capabilities.
For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks.
arXiv Detail & Related papers (2024-09-26T16:44:02Z) - VILAS: Exploring the Effects of Vision and Language Context in Automatic
Speech Recognition [18.19998336526969]
ViLaS (Vision and Language into Automatic Speech Recognition) is a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism.
To explore the effects of integrating vision and language, we create VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese and English versions.
arXiv Detail & Related papers (2023-05-31T16:01:20Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.