Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding
- URL: http://arxiv.org/abs/2306.10684v1
- Date: Mon, 19 Jun 2023 03:10:57 GMT
- Title: Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding
- Authors: Zengjie Song and Zhaoxiang Zhang
- Abstract summary: Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
- Score: 57.08832099075793
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The framework of visually-guided sound source separation generally consists
of three parts: visual feature extraction, multimodal feature fusion, and sound
signal processing. An ongoing trend in this field has been to tailor involved
visual feature extractor for informative visual guidance and separately devise
module for feature fusion, while utilizing U-Net by default for sound analysis.
However, such divide-and-conquer paradigm is parameter inefficient and,
meanwhile, may obtain suboptimal performance as jointly optimizing and
harmonizing various model components is challengeable. By contrast, this paper
presents a novel approach, dubbed audio-visual predictive coding (AVPC), to
tackle this task in a parameter efficient and more effective manner. The
network of AVPC features a simple ResNet-based video analysis network for
deriving semantic visual features, and a predictive coding-based sound
separation network that can extract audio features, fuse multimodal
information, and predict sound separation masks in the same architecture. By
iteratively minimizing the prediction error between features, AVPC integrates
audio and visual information recursively, leading to progressively improved
performance. In addition, we develop a valid self-supervised learning strategy
for AVPC via co-predicting two audio-visual representations of the same sound
source. Extensive evaluations demonstrate that AVPC outperforms several
baselines in separating musical instrument sounds, while reducing the model
size significantly. Code is available at:
https://github.com/zjsong/Audio-Visual-Predictive-Coding.
Related papers
- Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Improving Audio-Visual Segmentation with Bidirectional Generation [40.78395709407226]
We introduce a bidirectional generation framework for audio-visual segmentation.
This framework establishes robust correlations between an object's visual characteristics and its associated sound.
We also introduce an implicit volumetric motion estimation module to handle temporal dynamics.
arXiv Detail & Related papers (2023-08-16T11:20:23Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Audio-Visual Speech Separation in Noisy Environments with a Lightweight
Iterative Model [35.171785986428425]
We propose Audio-Visual Lightweight ITerative model (AVLIT) to perform audio-visual speech separation in noisy environments.
Our architecture consists of an audio branch and a video branch, with iterative A-FRCNN blocks sharing weights for each modality.
Experiments demonstrate the superiority of our model in both settings with respect to various audio-only and audio-visual baselines.
arXiv Detail & Related papers (2023-05-31T20:09:50Z) - CASP-Net: Rethinking Video Saliency Prediction from an
Audio-VisualConsistency Perceptual Perspective [30.995357472421404]
Video Saliency Prediction (VSP) imitates the selective attention mechanism of human brain.
Most VSP methods exploit semantic correlation between vision and audio modalities but ignore the negative effects due to the temporal inconsistency of audio-visual intrinsics.
Inspired by the biological inconsistency-correction within multi-sensory information, a consistency-aware audio-visual saliency prediction network (CASP-Net) is proposed.
arXiv Detail & Related papers (2023-03-11T09:29:57Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.