Improving Mandarin End-to-End Speech Recognition with Word N-gram
Language Model
- URL: http://arxiv.org/abs/2201.01995v1
- Date: Thu, 6 Jan 2022 10:04:56 GMT
- Title: Improving Mandarin End-to-End Speech Recognition with Word N-gram
Language Model
- Authors: Jinchuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, and Dong Yu
- Abstract summary: External language models (LMs) are used to improve the recognition performance of end-to-end (E2E) automatic speech recognition (ASR) systems.
We propose a novel decoding algorithm where a word-level lattice is constructed on-the-fly to consider all possible word sequences.
Our method consistently outperforms subword-level LMs, including N-gram LM and neural network LM.
- Score: 57.92200214957124
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite the rapid progress of end-to-end (E2E) automatic speech recognition
(ASR), it has been shown that incorporating external language models (LMs) into
the decoding can further improve the recognition performance of E2E ASR
systems. To align with the modeling units adopted in E2E ASR systems,
subword-level (e.g., characters, BPE) LMs are usually used to cooperate with
current E2E ASR systems. However, the use of subword-level LMs will ignore the
word-level information, which may limit the strength of the external LMs in E2E
ASR. Although several methods have been proposed to incorporate word-level
external LMs in E2E ASR, these methods are mainly designed for languages with
clear word boundaries such as English and cannot be directly applied to
languages like Mandarin, in which each character sequence can have multiple
corresponding word sequences. To this end, we propose a novel decoding
algorithm where a word-level lattice is constructed on-the-fly to consider all
possible word sequences for each partial hypothesis. Then, the LM score of the
hypothesis is obtained by intersecting the generated lattice with an external
word N-gram LM. The proposed method is examined on both Attention-based
Encoder-Decoder (AED) and Neural Transducer (NT) frameworks. Experiments
suggest that our method consistently outperforms subword-level LMs, including
N-gram LM and neural network LM. We achieve state-of-the-art results on both
Aishell-1 (CER 4.18%) and Aishell-2 (CER 5.06%) datasets and reduce CER by
14.8% relatively on a 21K-hour Mandarin dataset.
Related papers
- Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024 [61.189875635090225]
Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST)
arXiv Detail & Related papers (2024-06-24T16:38:17Z) - Nearest Neighbor Speculative Decoding for LLM Generation and Attribution [87.3259169631789]
Nearest Speculative Decoding (NEST) is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources.
NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks.
In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.
arXiv Detail & Related papers (2024-05-29T17:55:03Z) - Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition [23.172469312225694]
We propose to utilize an instruction-tuned large language model (LLM) for guiding the text generation process in automatic speech recognition (ASR)
The proposed model is built on the joint CTC and attention architecture, with the LLM serving as a front-end feature extractor for the decoder.
Experimental results show that the proposed LLM-guided model achieves a relative gain of approximately 13% in word error rates across major benchmarks.
arXiv Detail & Related papers (2023-09-19T11:10:50Z) - Prompting Large Language Models for Zero-Shot Domain Adaptation in
Speech Recognition [33.07184218085399]
With only a domain-specific text prompt, we propose two zero-shot ASR domain adaptation methods using LLaMA.
Experiments show that, with only one domain prompt, both methods can effectively reduce word error rates (WER) on out-of-domain TedLium-2 and SPGI datasets.
arXiv Detail & Related papers (2023-06-28T08:29:00Z) - Joint Prompt Optimization of Stacked LLMs using Variational Inference [66.04409787899583]
Large language models (LLMs) can be seen as atomic units of computation mapping sequences to a distribution over sequences.
By stacking two such layers and feeding the output of one layer to the next, we obtain a Deep Language Network (DLN)
We show that DLN-2 can reach higher performance than a single layer, showing promise that we might reach comparable performance to GPT-4.
arXiv Detail & Related papers (2023-06-21T18:45:56Z) - Unified model for code-switching speech recognition and language
identification based on a concatenated tokenizer [17.700515986659063]
Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation.
This paper proposes a new method for creating code-switching ASR datasets from purely monolingual data sources.
A novel Concatenated Tokenizer enables ASR models to generate language ID for each emitted text token while reusing existing monolingual tokenizers.
arXiv Detail & Related papers (2023-06-14T21:24:11Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Spiral Language Modeling [5.816641790933646]
Spiral Language Modeling (SLM) is a general approach that enables one to construct natural language sentences beyond the L2R and R2L order.
SLM allows one to form natural language text by starting from an arbitrary token inside the result text.
Experiments on 8 widely studied Neural Machine Translation (NMT) tasks show that SLM is constantly effective with up to 4.7 BLEU increase.
arXiv Detail & Related papers (2021-12-20T14:08:38Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.