Related papers: Investigating and Enhancing Vision-Audio Capability in Omnimodal Large Language Models

Investigating and Enhancing Vision-Audio Capability in Omnimodal Large Language Models

URL: http://arxiv.org/abs/2503.00059v1
Date: Thu, 27 Feb 2025 02:19:09 GMT
Title: Investigating and Enhancing Vision-Audio Capability in Omnimodal Large Language Models
Authors: Rui Hu, Delai Qiu, Shuyu Wei, Jiaming Zhang, Yining Wang, Shengping Liu, Jitao Sang,
Abstract summary: We propose a Self-Knowledge Distillation (Self-KD) training method where the vision-text component of the OLLM serves as the teacher and the vision-audio component as the student.<n>Our experimental results demonstrate that Self-KD is an effective method for enhancing the vision-audio capabilities of OLLMs.
Score: 20.210120763433167
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Omnimodal Large Language Models (OLLMs) have shown significant progress in integrating vision and text, but still struggle with integrating vision and audio, often exhibiting suboptimal performance when processing audio queries compared to text queries. This disparity is primarily due to insufficient alignment between vision and audio modalities during training, leading to inadequate attention to visual information when using audio queries. To mitigate this issue, we propose a Self-Knowledge Distillation (Self-KD) training method where the vision-text component of the OLLM serves as the teacher and the vision-audio component as the student. This enables the model to process audio in a manner analogous to its text processing. Our experimental results demonstrate that Self-KD is an effective method for enhancing the vision-audio capabilities of OLLMs by learning from the vision-text components, which subsequently improves the interaction between audio and images and results in improved performance on multimodal tasks.

Related papers

From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement [36.136070412464214]
Speech Enhancement (SE) aims to improve the quality of noisy speech.<n>We propose a novel multi-modality learning framework for SE.<n>We show that our proposed AVSE system significantly enhances speech quality and reduces generative artifacts.
arXiv Detail & Related papers (2025-01-23T04:36:29Z)
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction [105.88658935310605]
We propose a multi-stage training methodology that progressively trains LLM to understand both visual and speech information. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities.
arXiv Detail & Related papers (2025-01-03T18:59:52Z)
NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training [6.34265125858783]
We propose a noise-robust framework for efficient vision-language pre-training that requires less pre-training data. Specifically, we bridge the modality gap between a frozen image encoder and a large language model with a transformer. We introduce two innovative learning strategies: noise-adaptive learning and concept-enhanced learning.
arXiv Detail & Related papers (2024-09-15T01:54:17Z)
Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection. The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task. The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z)
Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform. We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions. We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z)
Accommodating Audio Modality in CLIP for Multimodal Processing [48.83906067348211]
We extend the Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing. Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities. Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning.
arXiv Detail & Related papers (2023-03-12T06:57:01Z)
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model) The proposed VATLM employs a unified backbone network to model the modality-independent information. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z)
Curriculum Audiovisual Learning [113.20920928789867]
We present a flexible audiovisual model that introduces a soft-clustering module as the audio and visual content detector. To ease the difficulty of audiovisual learning, we propose a novel learning strategy that trains the model from simple to complex scene. We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation without referring external visual supervision.
arXiv Detail & Related papers (2020-01-26T07:08:47Z)
Deep Audio-Visual Learning: A Survey [53.487938108404244]
We divide the current audio-visual learning tasks into four different subfields. We discuss state-of-the-art methods as well as the remaining challenges of each subfield. We summarize the commonly used datasets and performance metrics.
arXiv Detail & Related papers (2020-01-14T13:11:21Z)
Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment. We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.