Related papers: Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

URL: http://arxiv.org/abs/2407.10373v1
Date: Mon, 15 Jul 2024 00:47:56 GMT
Title: Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion
Authors: Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng,
Abstract summary: We introduce MVSD, a mutual learning framework based on diffusion models. MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks. Our framework can improve the performance of the reverberator and dereverberator.
Score: 93.32354378820648
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual acoustic matching (VAM) is pivotal for enhancing the immersive experience, and the task of dereverberation is effective in improving audio intelligibility. Existing methods treat each task independently, overlooking the inherent reciprocity between them. Moreover, these methods depend on paired training data, which is challenging to acquire, impeding the utilization of extensive unpaired data. In this paper, we introduce MVSD, a mutual learning framework based on diffusion models. MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks and overcome data scarcity. Furthermore, we employ the diffusion model as foundational conditional converters to circumvent the training instability and over-smoothing drawbacks of conventional GAN architectures. Specifically, MVSD employs two converters: one for VAM called reverberator and one for dereverberation called dereverberator. The dereverberator judges whether the reverberation audio generated by reverberator sounds like being in the conditional visual scenario, and vice versa. By forming a closed loop, these two converters can generate informative feedback signals to optimize the inverse tasks, even with easily acquired one-way unpaired data. Extensive experiments on two standard benchmarks, i.e., SoundSpaces-Speech and Acoustic AVSpeech, exhibit that our framework can improve the performance of the reverberator and dereverberator and better match specified visual scenarios.

Related papers

Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training [102.18680666349806]
We propose a speed co-augmentation method that randomly changes the playback speeds of both audio and video data. Experimental results show that the proposed method significantly improves the learned representations when compared to vanilla audio-visual contrastive learning.
arXiv Detail & Related papers (2023-09-25T08:22:30Z)
DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and Highlight Detection [38.12212015133935]
A novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process. Experiments conducted on five widely-used benchmarks demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.
arXiv Detail & Related papers (2023-08-29T08:20:23Z)
Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model. The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER. We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z)
Anticipating the Unseen Discrepancy for Vision and Language Navigation [63.399180481818405]
Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target. The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well. We propose Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency.
arXiv Detail & Related papers (2022-09-10T19:04:40Z)
Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments. We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition. We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z)
Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal. In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective. Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z)
Robust Audio-Visual Instance Discrimination [79.74625434659443]
We present a self-supervised learning method to learn audio and video representations. We address the problems of audio-visual instance discrimination and improve transfer learning performance.
arXiv Detail & Related papers (2021-03-29T19:52:29Z)
Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time. We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z)
Should we hard-code the recurrence concept or learn it instead ? Exploring the Transformer architecture for Audio-Visual Speech Recognition [10.74796391075403]
We present a variant of AV Align where the recurrent Long Short-term Memory (LSTM) block is replaced by the more recently proposed Transformer block. We find that Transformers also learn cross-modal monotonic alignments, but suffer from the same visual convergence problems as the LSTM model.
arXiv Detail & Related papers (2020-05-19T09:06:39Z)
Audio-Visual Decision Fusion for WFST-based and seq2seq Models [3.2771898634434997]
Under noisy conditions, speech recognition systems suffer from high Word Error Rates (WER) We propose novel methods to fuse information from audio and visual modalities at inference time. We show that our methods give significant improvements over acoustic-only WER.
arXiv Detail & Related papers (2020-01-29T13:45:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.