Exploring Audio Cues for Enhanced Test-Time Video Model Adaptation
- URL: http://arxiv.org/abs/2506.12481v1
- Date: Sat, 14 Jun 2025 12:44:58 GMT
- Title: Exploring Audio Cues for Enhanced Test-Time Video Model Adaptation
- Authors: Runhao Zeng, Qi Deng, Ronghao Zhang, Shuaicheng Niu, Jian Chen, Xiping Hu, Victor C. M. Leung,
- Abstract summary: Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase.<n>We propose a novel approach that incorporates audio information into video TTA.<n>Our method capitalizes on the rich semantic content of audio to generate audio-assisted pseudo-labels.
- Score: 46.29811604867483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase. While most existing TTA methods for video primarily utilize visual supervisory signals, they often overlook the potential contribution of inherent audio data. To address this gap, we propose a novel approach that incorporates audio information into video TTA. Our method capitalizes on the rich semantic content of audio to generate audio-assisted pseudo-labels, a new concept in the context of video TTA. Specifically, we propose an audio-to-video label mapping method by first employing pre-trained audio models to classify audio signals extracted from videos and then mapping the audio-based predictions to video label spaces through large language models, thereby establishing a connection between the audio categories and video labels. To effectively leverage the generated pseudo-labels, we present a flexible adaptation cycle that determines the optimal number of adaptation iterations for each sample, based on changes in loss and consistency across different views. This enables a customized adaptation process for each sample. Experimental results on two widely used datasets (UCF101-C and Kinetics-Sounds-C), as well as on two newly constructed audio-video TTA datasets (AVE-C and AVMIT-C) with various corruption types, demonstrate the superiority of our approach. Our method consistently improves adaptation performance across different video classification models and represents a significant step forward in integrating audio information into video TTA. Code: https://github.com/keikeiqi/Audio-Assisted-TTA.
Related papers
- C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together.
C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities.
Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z) - AudioSetMix: Enhancing Audio-Language Datasets with LLM-Assisted Augmentations [1.2101820447447276]
Multi-modal learning in the audio-language domain has seen significant advancements in recent years.
However, audio-language learning faces challenges due to limited and lower-quality data compared to image-language tasks.
Our method systematically generates audio-caption pairs by augmenting audio clips with natural language labels and corresponding audio signal processing operations.
This scalable method produces AudioSetMix, a high-quality training dataset for text-and-audio related models.
arXiv Detail & Related papers (2024-05-17T21:08:58Z) - Text-to-Audio Generation Synchronized with Videos [44.848393652233796]
We introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench.
We also present a simple yet effective video-aligned TTA generation model, namely T2AV.
It employs a temporal multi-head attention transformer to extract and understand temporal nuances from video data, a feat amplified by our Audio-Visual ControlNet.
arXiv Detail & Related papers (2024-03-08T22:27:38Z) - STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment [61.83340833859382]
Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks.
This is a nontemporal problem and poses two critical challenges: sparse-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations.
We propose a continual audio-video pre-training method with two novel ideas.
arXiv Detail & Related papers (2023-10-12T10:50:21Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - AudioFormer: Audio Transformer learns audio feature representations from
discrete acoustic codes [6.375996974877916]
We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes.
Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
arXiv Detail & Related papers (2023-08-14T15:47:25Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Learnable Irrelevant Modality Dropout for Multimodal Action Recognition
on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition.
We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels.
We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.