Related papers: Audio-visual training for improved grounding in video-text LLMs

Audio-visual training for improved grounding in video-text LLMs

URL: http://arxiv.org/abs/2407.15046v1
Date: Sun, 21 Jul 2024 03:59:14 GMT
Title: Audio-visual training for improved grounding in video-text LLMs
Authors: Shivprasad Sagare, Hemachandran S, Kinshuk Sarabhai, Prashant Ullegaddi, Rajeshkumar SA,
Abstract summary: We propose a model architecture that handles audio-visual inputs explicitly. We train our model with both audio and visual data from a video instruction-tuning dataset. For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset.
Score: 1.9320359360360702
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent advances in multimodal LLMs, have led to several video-text models being proposed for critical video-related tasks. However, most of the previous works support visual input only, essentially muting the audio signal in the video. Few models that support both audio and visual input, are not explicitly trained on audio data. Hence, the effect of audio towards video understanding is largely unexplored. To this end, we propose a model architecture that handles audio-visual inputs explicitly. We train our model with both audio and visual data from a video instruction-tuning dataset. Comparison with vision-only baselines, and other audio-visual models showcase that training on audio data indeed leads to improved grounding of responses. For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset, with audio-aware question-answer pairs.

Related papers

Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation [20.446421146630474]
We introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising.<n>Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony.
arXiv Detail & Related papers (2025-12-02T06:31:38Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Aligned Better, Listen Better for Audio-Visual Large Language Models [21.525317311280205]
Video inherently contains audio, which supplies information to vision. Video large language models (Video-LLMs) can encounter many audio-centric settings. Existing models exhibit deficiencies in exploiting audio information, leading to weak understanding and hallucinations.
arXiv Detail & Related papers (2025-04-02T18:47:09Z)
Audio-centric Video Understanding Benchmark without Text Shortcut [49.01648001666229]
Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs)<n>This paper proposes an audio-centric video understanding benchmark (AVUT) to evaluate the video comprehension capabilities of multimodal LLMs.
arXiv Detail & Related papers (2025-03-25T16:28:24Z)
Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup [2.80888070977859]
We propose audio-visual SSL for video action recognition, which uses both visual and audio together. In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed framework.
arXiv Detail & Related papers (2025-03-04T05:13:56Z)
From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation [17.95017332858846]
We introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation. VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively. Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features.
arXiv Detail & Related papers (2024-09-27T20:26:34Z)
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models [27.54879344983513]
Video-SALMONN can understand not only visual frame sequences, audio events and music, but speech as well. Video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs.
arXiv Detail & Related papers (2024-06-22T01:36:11Z)
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z)
Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup. We introduce a unified audio-visual few-shot video classification benchmark on three datasets. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z)
Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries [18.224608377111533]
Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task. We propose a framework for recommending HQ SFX given a video frame. We show that our system, trained using our automatic data curation pipeline, significantly outperforms baselines trained on in-the-wild data.
arXiv Detail & Related papers (2023-08-17T16:38:30Z)
Large-scale unsupervised audio pre-training for video-to-speech synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz. We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z)
Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z)
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks. Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z)
Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers. We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.