Related papers: Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

URL: http://arxiv.org/abs/2307.02909v1
Date: Thu, 6 Jul 2023 10:50:46 GMT
Title: Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition
Authors: Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Mingyu Cui, Helen Meng, Xunying Liu
Abstract summary: An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper. Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
Score: 52.11964238935099
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashion via mask-based WPD are investigated. The error cost mismatch between the speech enhancement front-end and ASR back-end components is minimized by end-to-end jointly fine-tuning using either the ASR cost function alone, or its interpolation with the speech enhancement loss. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel speech separation, dereverberation and recognition systems consistently outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute (41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech enhancement improvements were also obtained on PESQ, STOI and SRMR scores.

Related papers

Cocktail-Party Audio-Visual Speech Recognition [58.222892601847924]
This study introduces a novel audio-visual cocktail-party dataset designed to benchmark current AVSR systems.<n>We contribute a 1526-hour AVSR dataset comprising both talking-face and silent-face segments, enabling significant performance gains in cocktail-party environments.<n>Our approach reduces WER by 67% relative to the state-of-the-art, reducing WER from 119% to 39.2% in extreme noise, without relying on explicit segmentation cues.
arXiv Detail & Related papers (2025-06-02T19:07:51Z)
Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition [15.610658840718607]
We propose a mixture encoder to mitigate the effect of artifacts introduced by the speech separation. We extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps. Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder.
arXiv Detail & Related papers (2023-09-15T14:57:28Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z)
Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation. New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z)
ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement [40.29155338515071]
ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis. It achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model.
arXiv Detail & Related papers (2022-12-21T21:36:52Z)
Audio-visual multi-channel speech separation, dereverberation and recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z)
Continuous Speech Separation with Conformer [60.938212082732775]
We use transformer and conformer in lieu of recurrent neural networks in the separation system. We believe capturing global information with the self-attention based method is crucial for the speech separation.
arXiv Detail & Related papers (2020-08-13T09:36:05Z)
Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end. Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.