Multi-head Monotonic Chunkwise Attention For Online Speech Recognition
- URL: http://arxiv.org/abs/2005.00205v1
- Date: Fri, 1 May 2020 04:00:51 GMT
- Title: Multi-head Monotonic Chunkwise Attention For Online Speech Recognition
- Authors: Baiji Liu and Songjun Cao and Sining Sun and Weibin Zhang and Long Ma
- Abstract summary: We propose multi-head monotonic chunk-wise attention (MTH-MoChA), an improved version of MoChA.
MTH-MoChA splits the input sequence into small chunks and computes multi-head attentions over the chunks.
Experiments on AISHELL-1 data show that the proposed model, along with the training strategies, improve the character error rate (CER) of MoChA from 8.96% to 7.68% on test set.
- Score: 12.619595173472465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The attention mechanism of the Listen, Attend and Spell (LAS) model requires
the whole input sequence to calculate the attention context and thus is not
suitable for online speech recognition. To deal with this problem, we propose
multi-head monotonic chunk-wise attention (MTH-MoChA), an improved version of
MoChA. MTH-MoChA splits the input sequence into small chunks and computes
multi-head attentions over the chunks. We also explore useful training
strategies such as LSTM pooling, minimum world error rate training and
SpecAugment to further improve the performance of MTH-MoChA. Experiments on
AISHELL-1 data show that the proposed model, along with the training
strategies, improve the character error rate (CER) of MoChA from 8.96% to 7.68%
on test set. On another 18000 hours in-car speech data set, MTH-MoChA obtains
7.28% CER, which is significantly better than a state-of-the-art hybrid system.
Related papers
- MoH: Multi-Head Attention as Mixture-of-Head Attention [63.67734699877724]
We upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level.
We propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts mechanism.
MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters.
arXiv Detail & Related papers (2024-10-15T17:59:44Z) - MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose MC-MoE, a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)
MC-MoE leverages the significance of both experts and tokens to achieve an extreme compression.
For instance, at 2.54 bits, MC-MoE compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z) - Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.
Existing direct preference learning algorithms are originally designed for the single-turn chat task.
We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Multimodal Attention Merging for Improved Speech Recognition and Audio
Event Classification [20.206229252251717]
Multimodal Attention Merging (MAM)
MAM reduces the relative Word Error Rate (WER) of an Automatic Speech Recognition (ASR) model by up to 6.70%.
Learnable-MAM, a data-driven approach to merging attention matrices, results in a further 2.90% relative reduction in WER for ASR and 18.42% relative reduction in AEC.
arXiv Detail & Related papers (2023-12-22T02:08:40Z) - Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised
Audio-Visual Video Parsing [107.031903351176]
Weakly-separated audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances.
WS-AVVP aims to identify the corresponding event categories with only video-level category labels for training.
arXiv Detail & Related papers (2023-07-05T05:55:10Z) - Combining Spatial Clustering with LSTM Speech Models for Multichannel
Speech Enhancement [3.730592618611028]
Recurrent neural networks using the LSTM architecture can achieve significant single-channel noise reduction.
It is not obvious, however, how to apply them to multi-channel inputs in a way that can generalize to new microphone configurations.
This paper combines the two approaches to attain both the spatial separation performance and generality of multichannel spatial clustering.
arXiv Detail & Related papers (2020-12-02T22:37:50Z) - Multimodal Semi-supervised Learning Framework for Punctuation Prediction
in Conversational Speech [17.602098162338137]
We explore a multimodal semi-supervised learning approach for punctuation prediction.
We learn representations from large amounts of unlabelled audio and text data.
When trained on 1 hour of speech and text data, the proposed model achieved 9-18% absolute improvement over baseline model.
arXiv Detail & Related papers (2020-08-03T08:13:09Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z) - Attention based on-device streaming speech recognition with large speech
corpus [16.702653972113023]
We present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus.
We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses.
For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain.
arXiv Detail & Related papers (2020-01-02T04:24:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.