Related papers: Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model

Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model

URL: http://arxiv.org/abs/2310.19581v1
Date: Mon, 30 Oct 2023 14:39:34 GMT
Title: Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model
Authors: Suyeon Lee, Chaeyoung Jung, Youngjoon Jang, Jaehun Kim, Joon Son Chung
Abstract summary: We propose AVDiffuSS, an audio-visual speech separation model based on a diffusion mechanism known for its capability in generating natural samples. For an effective fusion of the two modalities for diffusion, we also propose a cross-attention-based feature fusion mechanism. We demonstrate that the proposed framework achieves state-of-the-art results on two benchmarks, including VoxCeleb2 and LRS3, producing speech with notably better naturalness.
Score: 13.96610874947899
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The objective of this work is to extract target speaker's voice from a mixture of voices using visual cues. Existing works on audio-visual speech separation have demonstrated their performance with promising intelligibility, but maintaining naturalness remains a challenge. To address this issue, we propose AVDiffuSS, an audio-visual speech separation model based on a diffusion mechanism known for its capability in generating natural samples. For an effective fusion of the two modalities for diffusion, we also propose a cross-attention-based feature fusion mechanism. This mechanism is specifically tailored for the speech domain to integrate the phonetic information from audio-visual correspondence in speech generation. In this way, the fusion process maintains the high temporal resolution of the features, without excessive computational requirements. We demonstrate that the proposed framework achieves state-of-the-art results on two benchmarks, including VoxCeleb2 and LRS3, producing speech with notably better naturalness.

Related papers

RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling [3.0550455962720764]
Lip-to-speech (L2S) synthesis, which reconstructs speech from visual cues, faces challenges in accuracy and naturalness.<n>We propose RESOUND, a novel L2S system that generates intelligible and expressive speech from silent talking face videos.
arXiv Detail & Related papers (2025-05-28T06:46:13Z)
Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction [13.5641621193917]
In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production. We introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements.
arXiv Detail & Related papers (2024-04-19T09:08:44Z)
Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE) We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z)
Disentangling Voice and Content with Self-Supervision for Speaker Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z)
Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition [15.610658840718607]
We propose a mixture encoder to mitigate the effect of artifacts introduced by the speech separation. We extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps. Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder.
arXiv Detail & Related papers (2023-09-15T14:57:28Z)
DiffVoice: Text-to-Speech with Latent Diffusion [18.150627638754923]
We present DiffVoice, a novel text-to-speech model based on latent diffusion. Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate that our method beats the best publicly available systems in naturalness.
arXiv Detail & Related papers (2023-04-23T21:05:33Z)
Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection [15.884911752869437]
We present a novel approach for synthetic speech detection, exploiting the combination of two high-level semantic properties of the human voice. On one side, we focus on speaker identity cues and represent them as speaker embeddings extracted using a state-of-the-art method for the automatic speaker verification task. On the other side, voice prosody, intended as variations in rhythm, pitch or accent in speech, is extracted through a specialized encoder.
arXiv Detail & Related papers (2022-10-31T11:03:03Z)
Audio-visual multi-channel speech separation, dereverberation and recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z)
Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing. Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video. We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z)
Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions. A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
Audio-visual Speech Separation with Adversarially Disentangled Visual Representation [23.38624506211003]
Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers. In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem. Our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models.
arXiv Detail & Related papers (2020-11-29T10:48:42Z)
Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.