Self-Attention Channel Combinator Frontend for End-to-End Multichannel
Far-field Speech Recognition
- URL: http://arxiv.org/abs/2109.04783v1
- Date: Fri, 10 Sep 2021 11:03:43 GMT
- Title: Self-Attention Channel Combinator Frontend for End-to-End Multichannel
Far-field Speech Recognition
- Authors: Rong Gong, Carl Quillen, Dushyant Sharma, Andrew Goderre, Jos\'e
La\'inez, Ljubomir Milanovi\'c
- Abstract summary: When a sufficiently large far-field training data is presented, jointly optimizing a multichannel and an end-to-end (E2E) Automatic Speech Recognition (ASR) backend shows promising results.
Recent literature has shown traditional beamformer designs, such as MVDR (Minimum Varianceless Response) or fixed beamformers can be successfully integrated into an E2E ASR system with learnable parameters.
We propose the self-attention channel Distortionator (SACC) ASR, which leverages the self-attention mechanism to combine multichannel audio signals in the magnitude spectral domain.
- Score: 1.0276024900942875
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When a sufficiently large far-field training data is presented, jointly
optimizing a multichannel frontend and an end-to-end (E2E) Automatic Speech
Recognition (ASR) backend shows promising results. Recent literature has shown
traditional beamformer designs, such as MVDR (Minimum Variance Distortionless
Response) or fixed beamformers can be successfully integrated as the frontend
into an E2E ASR system with learnable parameters. In this work, we propose the
self-attention channel combinator (SACC) ASR frontend, which leverages the
self-attention mechanism to combine multichannel audio signals in the magnitude
spectral domain. Experiments conducted on a multichannel playback test data
shows that the SACC achieved a 9.3% WERR compared to a state-of-the-art fixed
beamformer-based frontend, both jointly optimized with a ContextNet-based ASR
backend. We also demonstrate the connection between the SACC and the
traditional beamformers, and analyze the intermediate outputs of the SACC.
Related papers
- MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder
and Input Feature Analysis [0.0]
We present an end-to-end multichannel speaker-attributed automatic speech recognition (MC-SA-ASR) system that combines a Conformer-based encoder with multi-frame crosschannel attention and a speaker-attributed Transformer-based decoder.
arXiv Detail & Related papers (2023-10-16T06:40:18Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Hybrid Knowledge-Data Driven Channel Semantic Acquisition and
Beamforming for Cell-Free Massive MIMO [6.010360758759109]
This paper focuses on advancing outdoor wireless systems to better support ubiquitous extended reality (XR) applications.
We propose a hybrid knowledge-data driven method for channel semantic acquisition and multi-user beamforming in cell-free massive multiple-input multiple-output (MIMO) systems.
arXiv Detail & Related papers (2023-07-06T15:35:55Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - End-to-End Multi-speaker ASR with Independent Vector Analysis [80.83577165608607]
We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.
We propose a paradigm for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm.
arXiv Detail & Related papers (2022-04-01T05:45:33Z) - Spatial Processing Front-End For Distant ASR Exploiting Self-Attention
Channel Combinator [11.248169478873344]
We present a novel multi-channel front-end based on channel shortening with theWeighted Prediction Error (WPE) method.
We show that the proposed system used as part of a ContextNet based end-to-end (E2E) ASR system outperforms leading ASR systems.
arXiv Detail & Related papers (2022-03-25T21:43:15Z) - The RoyalFlush System of Speech Recognition for M2MeT Challenge [5.863625637354342]
This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge.
We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data.
Our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set.
arXiv Detail & Related papers (2022-02-03T14:38:26Z) - Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input.
We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.