CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment
- URL: http://arxiv.org/abs/2602.08309v1
- Date: Mon, 09 Feb 2026 06:30:25 GMT
- Title: CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment
- Authors: Yunzuo Hu, Wen Li, Jing Zhang,
- Abstract summary: We propose a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning.<n>Two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE)<n>CASTE balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment.<n>CASE injects cross-modal semantic guidance into selectedtemporal positions, leveraging high-level semantic cues to further alleviate misalignment.
- Score: 12.793962173450494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level semantic cues to further alleviate misalignment. In addition, we design lightweight objectives, caption-to-modality InfoNCE, visual-audio consistency, and entropy regularization to guide token selection and strengthen cross-modal semantic alignment. With frozen backbones, CAE-AV achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks, and qualitative analyses further validate its robustness against audio-visual misalignment.
Related papers
- Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing [26.317163478761916]
Weakly-supervised audio-visual video parsing seeks to detect audible, visible, and audio-visual events without temporal annotations.<n>We propose an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks.<n>We also propose a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment-class pairs.
arXiv Detail & Related papers (2025-09-17T15:38:05Z) - AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition [2.4842074869626396]
We introduce a new AVSR framework termed AD-AVSR based on bidirectional modality enhancement.<n> Specifically, we first introduce the audio dual-stream encoding strategy to enrich audio representations from multiple perspectives.<n>We adopt a threshold-based selection mechanism to filter out irrelevant or weakly correlated audio-visual pairs.
arXiv Detail & Related papers (2025-08-11T04:23:08Z) - AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation [113.75682363364004]
AURORA is a framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation.<n>AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.
arXiv Detail & Related papers (2025-08-04T07:47:38Z) - Implicit Counterfactual Learning for Audio-Visual Segmentation [50.69377287012591]
We propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding.<n>Due to the lack of semantics, heterogeneous representations may lead to erroneous matches.<n>We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space.
arXiv Detail & Related papers (2025-07-28T11:46:35Z) - Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning [37.17910848101769]
Current vision-guided audio captioning systems fail to address audiovisual misalignment in real-world scenarios.<n>We present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification.<n>We also develop a batch-wise audiovisual shuffling technique that generates synthetic mismatched training pairs.
arXiv Detail & Related papers (2025-05-28T07:08:17Z) - CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment [76.32508013503653]
We propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning.<n>We tackle the mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations.<n>We improve spatial localization by introducing learnable register tokens that reduce semantic load on patch tokens.
arXiv Detail & Related papers (2025-05-02T12:59:58Z) - AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - CAD -- Contextual Multi-modal Alignment for Dynamic AVQA [20.155816093525374]
Existing AVQA methods suffer from two major shortcomings.
The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4%.
arXiv Detail & Related papers (2023-10-25T16:40:09Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Cross-Modal Global Interaction and Local Alignment for Audio-Visual
Speech Recognition [21.477900473255264]
We propose a cross-modal global interaction and local alignment (GILA) approach for audio-visual speech recognition (AVSR)
Specifically, we design a global interaction model to capture the A-V complementary relationship on modality level, as well as a local alignment approach to model the A-V temporal consistency on frame level.
Our GILA outperforms the supervised learning state-of-the-art on public benchmarks LRS3 and LRS2.
arXiv Detail & Related papers (2023-05-16T06:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.