Conditional Flow Matching for Visually-Guided Acoustic Highlighting
- URL: http://arxiv.org/abs/2602.03762v2
- Date: Wed, 04 Feb 2026 08:53:22 GMT
- Title: Conditional Flow Matching for Visually-Guided Acoustic Highlighting
- Authors: Hugo Malard, Gael Le Lan, Daniel Wong, David Lou Alon, Yi-Chiao Wu, Sanjeel Parekh,
- Abstract summary: Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience.<n>Existing approaches use discriminative models, which struggle with the inherent ambiguity in audio remixing.<n>We introduce a Conditional Flow Matching (CFM) framework to address this limitation.
- Score: 10.241097929338366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misalignment between visual and auditory focus. Existing approaches use discriminative models, which struggle with the inherent ambiguity in audio remixing, where no natural one-to-one mapping exists between poorly-balanced and well-balanced audio mixes. To address this limitation, we reframe this task as a generative problem and introduce a Conditional Flow Matching (CFM) framework. A key challenge in iterative flow-based generation is that early prediction errors -- in selecting the correct source to enhance -- compound over steps and push trajectories off-manifold. To address this, we introduce a rollout loss that penalizes drift at the final step, encouraging self-correcting trajectories and stabilizing long-range flow integration. We further propose a conditioning module that fuses audio and visual cues before vector field regression, enabling explicit cross-modal source selection. Extensive quantitative and qualitative evaluations show that our method consistently surpasses the previous state-of-the-art discriminative approach, establishing that visually-guided audio remixing is best addressed through generative modeling.
Related papers
- Self-diffusion for Solving Inverse Problems [3.8870795921263728]
We propose self-diffusion, a novel framework for solving inverse problems without relying on pretrained generative models.<n>Self-diffusion exploits the spectral bias of neural networks and modulates it through a scheduled noise process.<n>We demonstrate the effectiveness of our approach on a variety of linear inverse problems, showing that self-diffusion achieves competitive or superior performance compared to other methods.
arXiv Detail & Related papers (2025-10-24T12:57:22Z) - Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance [54.88271057438763]
Noise Awareness Guidance (NAG) is a correction method that explicitly steers sampling trajectories to remain consistent with the pre-defined noise schedule.<n>NAG consistently mitigates noise shift and substantially improves the generation quality of mainstream diffusion models.
arXiv Detail & Related papers (2025-10-14T13:31:34Z) - High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling [65.02357548201188]
We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning.<n>Our framework operates by synthesizing the desired separated sound spectrograms directly from a noise distribution, conditioned concurrently on the mixed audio input and associated visual information.
arXiv Detail & Related papers (2025-09-26T08:46:00Z) - StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation [91.45910771331741]
Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency.<n>This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing.
arXiv Detail & Related papers (2025-08-11T17:58:24Z) - BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models [62.38713281234756]
Binaural rendering pipeline aims to synthesize audio that mimics natural hearing based on a mono audio.<n>Many methods have been proposed to solve this problem, but they struggle with rendering quality and streamable inference.<n>We propose a flow matching based streaming speech framework called BinauralFlow synthesis framework.
arXiv Detail & Related papers (2025-05-28T20:59:15Z) - Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning [37.17910848101769]
Current vision-guided audio captioning systems fail to address audiovisual misalignment in real-world scenarios.<n>We present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification.<n>We also develop a batch-wise audiovisual shuffling technique that generates synthetic mismatched training pairs.
arXiv Detail & Related papers (2025-05-28T07:08:17Z) - AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z) - Unveiling and Mitigating Bias in Audio Visual Segmentation [9.427676046134374]
Community researchers have developed a range of advanced audio-visual segmentation models to improve the quality of sounding objects' masks.
While masks created by these models may initially appear plausible, they occasionally exhibit anomalies with incorrect grounding logic.
We attribute this to real-world inherent preferences and distributions as a simpler signal for learning than the complex audio-visual grounding.
arXiv Detail & Related papers (2024-07-23T16:55:04Z) - A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.