Contrastive Conditional Latent Diffusion for Audio-visual Segmentation
- URL: http://arxiv.org/abs/2307.16579v1
- Date: Mon, 31 Jul 2023 11:29:50 GMT
- Title: Contrastive Conditional Latent Diffusion for Audio-visual Segmentation
- Authors: Yuxin Mao, Jing Zhang, Mochu Xiang, Yunqiu Lv, Yiran Zhong, Yuchao Dai
- Abstract summary: We introduce a latent diffusion model to achieve semantic-correlated representation learning.
We argue it is essential to ensure that the conditional variable contributes to model output.
- Score: 37.83055692562661
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a latent diffusion model with contrastive learning for
audio-visual segmentation (AVS) to extensively explore the contribution of
audio. We interpret AVS as a conditional generation task, where audio is
defined as the conditional variable for sound producer(s) segmentation. With
our new interpretation, it is especially necessary to model the correlation
between audio and the final segmentation map to ensure its contribution. We
introduce a latent diffusion model to our framework to achieve
semantic-correlated representation learning. Specifically, our diffusion model
learns the conditional generation process of the ground-truth segmentation map,
leading to ground-truth aware inference when we perform the denoising process
at the test stage. As a conditional diffusion model, we argue it is essential
to ensure that the conditional variable contributes to model output. We then
introduce contrastive learning to our framework to learn audio-visual
correspondence, which is proven consistent with maximizing the mutual
information between model prediction and the audio data. In this way, our
latent diffusion model via contrastive learning explicitly maximizes the
contribution of audio for AVS. Experimental results on the benchmark dataset
verify the effectiveness of our solution. Code and results are online via our
project page: https://github.com/OpenNLPLab/DiffusionAVS.
Related papers
- Do Audio-Visual Segmentation Models Truly Segment Sounding Objects? [38.98706069359109]
We introduce AVSBench-Robust, a benchmark incorporating diverse negative audio scenarios including silence, ambient noise, and off-screen sounds.
Our approach achieves remarkable improvements in both standard metrics and robustness measures, maintaining near-perfect false positive rates.
arXiv Detail & Related papers (2025-02-01T07:40:29Z) - Diffusion-based Unsupervised Audio-visual Speech Enhancement [26.937216751657697]
This paper proposes a new unsupervised audio-visual speech enhancement (AVSE) approach.
It combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model.
Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervised-generative AVSE method.
arXiv Detail & Related papers (2024-10-04T12:22:54Z) - Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models.
MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks.
Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - An Efficient Membership Inference Attack for the Diffusion Model by
Proximal Initialization [58.88327181933151]
In this paper, we propose an efficient query-based membership inference attack (MIA)
Experimental results indicate that the proposed method can achieve competitive performance with only two queries on both discrete-time and continuous-time diffusion models.
To the best of our knowledge, this work is the first to study the robustness of diffusion models to MIA in the text-to-speech task.
arXiv Detail & Related papers (2023-05-26T16:38:48Z) - Audio-visual speech enhancement with a deep Kalman filter generative
model [0.0]
We present an audiovisual deep Kalman filter (AV-DKF) generative model which assumes a first-order Markov chain model for the latent variables.
We develop an efficient inference methodology to estimate speech signals at test time.
arXiv Detail & Related papers (2022-11-02T09:50:08Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal.
In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective.
Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z) - Deep Variational Generative Models for Audio-visual Speech Separation [33.227204390773316]
We propose an unsupervised technique based on audio-visual generative modeling of clean speech.
To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech.
Our experiments show that the proposed unsupervised VAE-based method yields better separation performance than NMF-based approaches.
arXiv Detail & Related papers (2020-08-17T10:12:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.