An efficient text augmentation approach for contextualized Mandarin speech recognition
- URL: http://arxiv.org/abs/2406.09950v1
- Date: Fri, 14 Jun 2024 11:53:14 GMT
- Title: An efficient text augmentation approach for contextualized Mandarin speech recognition
- Authors: Naijun Zheng, Xucheng Wan, Kai Liu, Ziqing Du, Zhou Huan,
- Abstract summary: Our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models.
To contextualize a pre-trained CIF-based ASR, we construct a codebook using limited speech-text data.
Our experiments on diverse Mandarin test sets demonstrate that our TA approach significantly boosts recognition performance.
- Score: 4.600045052545344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although contextualized automatic speech recognition (ASR) systems are commonly used to improve the recognition of uncommon words, their effectiveness is hindered by the inherent limitations of speech-text data availability. To address this challenge, our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models using a straightforward text-augmentation (TA) technique, all while keeping computational costs minimal. In particular, to contextualize a pre-trained CIF-based ASR, we construct a codebook using limited speech-text data. By utilizing a simple codebook lookup process, we convert available text-only data into latent text embeddings. These embeddings then enhance the inputs for the contextualized ASR. Our experiments on diverse Mandarin test sets demonstrate that our TA approach significantly boosts recognition performance. The top-performing system shows relative CER improvements of up to 30% on rare words and 15% across all words in general.
Related papers
- Grammar Induction from Visual, Speech and Text [91.98797120799227]
This work introduces a novel visual-audio-text grammar induction task (textbfVAT-GI)
Inspired by the fact that language grammar exists beyond the texts, we argue that the text has not to be the predominant modality in grammar induction.
We propose a visual-audio-text inside-outside autoencoder (textbfVaTiora) framework, which leverages rich modal-specific and complementary features for effective grammar parsing.
arXiv Detail & Related papers (2024-10-01T02:24:18Z) - Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - Pronunciation Assessment with Multi-modal Large Language Models [10.35401596425946]
We propose a scoring system based on large language models (LLMs)
The speech encoder first maps the learner's speech into contextual features.
The adapter layer then transforms these features to align with the text embedding in latent space.
arXiv Detail & Related papers (2024-07-12T12:16:14Z) - Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques [17.166092544686553]
This study benchmarks Speech Emotion Recognition using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora.
We propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript.
arXiv Detail & Related papers (2024-06-12T15:59:25Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - End-to-End Speech Recognition Contextualization with Large Language
Models [25.198480789044346]
We introduce a novel method for contextualizing speech recognition models incorporating Large Language Models (LLMs)
We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion.
Our empirical results demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided.
arXiv Detail & Related papers (2023-09-19T20:28:57Z) - Boosting Punctuation Restoration with Data Generation and Reinforcement
Learning [70.26450819702728]
Punctuation restoration is an important task in automatic speech recognition (ASR)
The discrepancy between written punctuated texts and ASR texts limits the usability of written texts in training punctuation restoration systems for ASR texts.
This paper proposes a reinforcement learning method to exploit in-topic written texts and recent advances in large pre-trained generative language models to bridge this gap.
arXiv Detail & Related papers (2023-07-24T17:22:04Z) - Looking and Listening: Audio Guided Text Recognition [62.98768236858089]
Text recognition in the wild is a long-standing problem in computer vision.
Recent studies suggest vision and language processing are effective for scene text recognition.
Yet, solving edit errors such as add, delete, or replace is still the main challenge for existing approaches.
We propose the AudioOCR, a simple yet effective probabilistic audio decoder for mel spectrogram sequence prediction.
arXiv Detail & Related papers (2023-06-06T08:08:18Z) - Text-Aware End-to-end Mispronunciation Detection and Diagnosis [17.286013739453796]
Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT)
In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information.
arXiv Detail & Related papers (2022-06-15T04:08:10Z) - Implicit Feature Alignment: Learn to Convert Text Recognizer to Text
Spotter [38.4211220941874]
We propose a simple, elegant and effective paradigm called Implicit Feature Alignment (IFA)
IFA can be easily integrated into current text recognizers, resulting in a novel inference mechanism called IFAinference.
We experimentally demonstrate that IFA achieves state-of-the-art performance on end-to-end document recognition tasks.
arXiv Detail & Related papers (2021-06-10T17:06:28Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.