Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition
with Source Localization
- URL: http://arxiv.org/abs/2011.00091v1
- Date: Fri, 30 Oct 2020 20:26:28 GMT
- Title: Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition
with Source Localization
- Authors: Aswin Shanmugam Subramanian, Chao Weng, Shinji Watanabe, Meng Yu, Yong
Xu, Shi-Xiong Zhang, Dong Yu
- Abstract summary: This paper proposes a new paradigm for handling far-field multi-speaker data in an end-to-end neural network manner, called directional automatic speech recognition (D-ASR)
In D-ASR, the azimuth angle of the sources with respect to the microphone array is defined as a latent variable. This angle controls the quality of separation, which in turn determines the ASR performance.
- Score: 73.62550438861942
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a new paradigm for handling far-field multi-speaker data
in an end-to-end neural network manner, called directional automatic speech
recognition (D-ASR), which explicitly models source speaker locations. In
D-ASR, the azimuth angle of the sources with respect to the microphone array is
defined as a latent variable. This angle controls the quality of separation,
which in turn determines the ASR performance. All three functionalities of
D-ASR: localization, separation, and recognition are connected as a single
differentiable neural network and trained solely based on ASR error
minimization objectives. The advantages of D-ASR over existing methods are
threefold: (1) it provides explicit speaker locations, (2) it improves the
explainability factor, and (3) it achieves better ASR performance as the
process is more streamlined. In addition, D-ASR does not require explicit
direction of arrival (DOA) supervision like existing data-driven localization
models, which makes it more appropriate for realistic data. For the case of two
source mixtures, D-ASR achieves an average DOA prediction error of less than
three degrees. It also outperforms a strong far-field multi-speaker end-to-end
system in both separation quality and ASR performance.
Related papers
- Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios [36.50731790624643]
We introduce RIR-SF, a novel spatial feature based on room impulse response (RIR)
RIR-SF significantly outperforms traditional 3D spatial features, showing superior theoretical and empirical performance.
We also propose an optimized all-neural multi-channel ASR framework for RIR-SF, achieving a relative 21.3% reduction in CER for target speaker ASR in multi-channel settings.
arXiv Detail & Related papers (2023-10-31T20:42:08Z) - End-to-End Multi-speaker ASR with Independent Vector Analysis [80.83577165608607]
We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.
We propose a paradigm for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm.
arXiv Detail & Related papers (2022-04-01T05:45:33Z) - A Comparative Study on Speaker-attributed Automatic Speech Recognition
in Multi-party Meetings [53.120885867427305]
Three approaches are evaluated for speaker-attributed automatic speech recognition (SA-ASR) in a meeting scenario.
The WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER)
The TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.
arXiv Detail & Related papers (2022-03-31T06:39:14Z) - ASR-Aware End-to-end Neural Diarization [15.172086811068962]
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
arXiv Detail & Related papers (2022-02-02T21:17:14Z) - Multi-Channel Multi-Speaker ASR Using 3D Spatial Feature [35.280174671205046]
We look into the challenge by utilizing the location information of target speakers in the 3D space for the first time.
Two paradigms are investigated: 1) a pipelined system with a multi-channel speech separation module followed by the state-of-the-art single-channel ASR module; 2) a "All-In-One" model where the 3D spatial feature is directly used as an input to ASR system without explicit separation modules.
Experimental results show that 1) the proposed ALL-In-One model achieved a comparable error rate to the pipelined system while reducing the inference time by half; 2) the proposed 3
arXiv Detail & Related papers (2021-11-22T07:19:12Z) - Discriminative Multi-modality Speech Recognition [17.296404414250553]
Vision is often used as a complementary modality for audio speech recognition (ASR)
In this paper, we propose a two-stage speech recognition model.
In the first stage, the target voice is separated from background noises with help from the corresponding visual information of lip movements, making the model 'listen' clearly.
At the second stage, the audio modality combines visual modality again to better understand the speech by a MSR sub-network, further improving the recognition rate.
arXiv Detail & Related papers (2020-05-12T07:56:03Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.