Related papers: DGFNet: End-to-End Audio-Visual Source Separation Based on Dynamic Gating Fusion

DGFNet: End-to-End Audio-Visual Source Separation Based on Dynamic Gating Fusion

URL: http://arxiv.org/abs/2504.21366v1
Date: Wed, 30 Apr 2025 06:55:24 GMT
Title: DGFNet: End-to-End Audio-Visual Source Separation Based on Dynamic Gating Fusion
Authors: Yinfeng Yu, Shiyu Sun,
Abstract summary: Current Audio-Visual Source Separation methods primarily adopt two design strategies.<n>The first strategy involves fusing audio and visual features at the bottleneck layer of the encoder, followed by processing the fused features through the decoder.<n>The second strategy avoids direct fusion and instead relies on the decoder to handle the interaction between audio and visual features.<n>This paper proposes a dynamic fusion method based on a gating mechanism that dynamically adjusts the modality fusion degree.
Score: 1.292190360867547
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Current Audio-Visual Source Separation methods primarily adopt two design strategies. The first strategy involves fusing audio and visual features at the bottleneck layer of the encoder, followed by processing the fused features through the decoder. However, when there is a significant disparity between the two modalities, this approach may lead to the loss of critical information. The second strategy avoids direct fusion and instead relies on the decoder to handle the interaction between audio and visual features. Nonetheless, if the encoder fails to integrate information across modalities adequately, the decoder may be unable to effectively capture the complex relationships between them. To address these issues, this paper proposes a dynamic fusion method based on a gating mechanism that dynamically adjusts the modality fusion degree. This approach mitigates the limitations of solely relying on the decoder and facilitates efficient collaboration between audio and visual features. Additionally, an audio attention module is introduced to enhance the expressive capacity of audio features, thereby further improving model performance. Experimental results demonstrate that our method achieves significant performance improvements on two benchmark datasets, validating its effectiveness and advantages in Audio-Visual Source Separation tasks.

Related papers

Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models [13.887164304514101]
The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs)<n>In current AV-LLMs, audio and video features are typically processed jointly in the decoder.<n>We propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications.
arXiv Detail & Related papers (2025-05-27T08:22:56Z)
DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction [5.13730975608994]
Audio-visual saliency prediction aims to mimic human visual attention by identifying salient regions in videos.<n>We propose Dynamic Token Fusion Saliency (DFTSal), a novel audio-visual saliency prediction framework designed to balance accuracy with computational efficiency.
arXiv Detail & Related papers (2025-04-14T10:17:25Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap [38.5017989456818]
DiffGAP is a novel approach incorporating a lightweight generative module within the contrastive space.<n>Our experimental results on VGGSound and AudioCaps datasets demonstrate that DiffGAP significantly improves performance in video/text-audio generation and retrieval tasks.
arXiv Detail & Related papers (2025-03-15T13:24:09Z)
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z)
Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges. This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations. We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation [22.28510611697998]
We propose a novel textbfAudio-aware query-enhanced textbfTRansformer (AuTR) to tackle the task. Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features.
arXiv Detail & Related papers (2023-07-25T03:59:04Z)
Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner. In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z)
Automated Audio Captioning via Fusion of Low- and High- Dimensional Features [48.62190893209622]
Existing AAC methods only use the high-dimensional representation of the PANNs as the input of the decoder. A new encoder-decoder framework is proposed called the Low- and High-Dimensional Feature Fusion (LHDFF) model for AAC. LHDFF achieves the best performance on the Clotho and AudioCaps datasets compared with other existing models.
arXiv Detail & Related papers (2022-10-10T22:39:41Z)
Attentive Fusion Enhanced Audio-Visual Encoding for Transformer Based Robust Speech Recognition [27.742673824969238]
The proposed method can increase the recognition rate by 0.55%, 4.51% and 4.61% on average under the clean, seen and unseen noise conditions. Experiments on the LRS3-TED dataset demonstrate that the proposed method can increase the recognition rate by 0.55%, 4.51% and 4.61% on average.
arXiv Detail & Related papers (2020-08-06T14:39:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.