Related papers: CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment

CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment

URL: http://arxiv.org/abs/2602.08309v1
Date: Mon, 09 Feb 2026 06:30:25 GMT
Title: CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment
Authors: Yunzuo Hu, Wen Li, Jing Zhang,
Abstract summary: We propose a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning.<n>Two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE)<n>CASTE balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment.<n>CASE injects cross-modal semantic guidance into selectedtemporal positions, leveraging high-level semantic cues to further alleviate misalignment.
Score: 12.793962173450494
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level semantic cues to further alleviate misalignment. In addition, we design lightweight objectives, caption-to-modality InfoNCE, visual-audio consistency, and entropy regularization to guide token selection and strengthen cross-modal semantic alignment. With frozen backbones, CAE-AV achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks, and qualitative analyses further validate its robustness against audio-visual misalignment.

Related papers

Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing [26.317163478761916]
Weakly-supervised audio-visual video parsing seeks to detect audible, visible, and audio-visual events without temporal annotations.<n>We propose an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks.<n>We also propose a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment-class pairs.
arXiv Detail & Related papers (2025-09-17T15:38:05Z)
AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition [2.4842074869626396]
We introduce a new AVSR framework termed AD-AVSR based on bidirectional modality enhancement.<n> Specifically, we first introduce the audio dual-stream encoding strategy to enrich audio representations from multiple perspectives.<n>We adopt a threshold-based selection mechanism to filter out irrelevant or weakly correlated audio-visual pairs.
arXiv Detail & Related papers (2025-08-11T04:23:08Z)
AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation [113.75682363364004]
AURORA is a framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation.<n>AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.
arXiv Detail & Related papers (2025-08-04T07:47:38Z)
Implicit Counterfactual Learning for Audio-Visual Segmentation [50.69377287012591]
We propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding.<n>Due to the lack of semantics, heterogeneous representations may lead to erroneous matches.<n>We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space.
arXiv Detail & Related papers (2025-07-28T11:46:35Z)
Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning [37.17910848101769]
Current vision-guided audio captioning systems fail to address audiovisual misalignment in real-world scenarios.<n>We present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification.<n>We also develop a batch-wise audiovisual shuffling technique that generates synthetic mismatched training pairs.
arXiv Detail & Related papers (2025-05-28T07:08:17Z)
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment [76.32508013503653]
We propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning.<n>We tackle the mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations.<n>We improve spatial localization by introducing learnable register tokens that reduce semantic load on patch tokens.
arXiv Detail & Related papers (2025-05-02T12:59:58Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges. This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations. We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z)
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA [20.155816093525374]
Existing AVQA methods suffer from two major shortcomings. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4%.
arXiv Detail & Related papers (2023-10-25T16:40:09Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition [21.477900473255264]
We propose a cross-modal global interaction and local alignment (GILA) approach for audio-visual speech recognition (AVSR) Specifically, we design a global interaction model to capture the A-V complementary relationship on modality level, as well as a local alignment approach to model the A-V temporal consistency on frame level. Our GILA outperforms the supervised learning state-of-the-art on public benchmarks LRS3 and LRS2.
arXiv Detail & Related papers (2023-05-16T06:41:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.