Exploring Retraining-Free Speech Recognition for Intra-sentential
Code-Switching
- URL: http://arxiv.org/abs/2109.00921v1
- Date: Fri, 27 Aug 2021 19:15:16 GMT
- Title: Exploring Retraining-Free Speech Recognition for Intra-sentential
Code-Switching
- Authors: Zhen Huang, Xiaodan Zhuang, Daben Liu, Xiaoqiang Xiao, Yuchen Zhang,
Sabato Marco Siniscalchi
- Abstract summary: We present our initial efforts for building a code-switching (CS) speech recognition system.
We have designed an automatic approach to obtain high quality pronunciation of foreign language words.
Our best system achieves a 55.5% relative word error rate reduction from 34.4%, obtained with a conventional monolingual ASR system.
- Score: 17.973043287866986
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present our initial efforts for building a code-switching
(CS) speech recognition system leveraging existing acoustic models (AMs) and
language models (LMs), i.e., no training required, and specifically targeting
intra-sentential switching. To achieve such an ambitious goal, new mechanisms
for foreign pronunciation generation and language model (LM) enrichment have
been devised. Specifically, we have designed an automatic approach to obtain
high quality pronunciation of foreign language (FL) words in the native
language (NL) phoneme set using existing acoustic phone decoders and an
LSTM-based grapheme-to-phoneme (G2P) model. Improved accented pronunciations
have thus been obtained by learning foreign pronunciations directly from data.
Furthermore, a code-switching LM was deployed by converting the original NL LM
into a CS LM using translated word pairs and borrowing statistics for the NL
LM. Experimental evidence clearly demonstrates that our approach better deals
with accented foreign pronunciations than techniques based on human labeling.
Moreover, our best system achieves a 55.5% relative word error rate reduction
from 34.4%, obtained with a conventional monolingual ASR system, to 15.3% on an
intra-sentential CS task without harming the monolingual recognition accuracy.
Related papers
- TIPAA-SSL: Text Independent Phone-to-Audio Alignment based on Self-Supervised Learning and Knowledge Transfer [3.9981390090442694]
We present a novel approach for text independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer.
We evaluate our model using synthetic native data from the TIMIT dataset and the SCRIBE dataset for American and British English.
Our proposed model outperforms the state-of-the-art (charsiu) in statistical metrics and has applications in language learning and speech processing systems.
arXiv Detail & Related papers (2024-05-03T14:25:21Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - Leveraging Language ID to Calculate Intermediate CTC Loss for Enhanced
Code-Switching Speech Recognition [5.3545957730615905]
We introduce language identification information into the middle layer of the ASR model's encoder.
We aim to generate acoustic features that imply language distinctions in a more implicit way, reducing the model's confusion when dealing with language switching.
arXiv Detail & Related papers (2023-12-15T07:46:35Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Pronunciation Generation for Foreign Language Words in Intra-Sentential
Code-Switching Speech Recognition [14.024346215923972]
Code-Switching refers to the phenomenon of switching languages within a sentence or discourse.
In this paper, we make use of limited code-switching data as driving materials and explore a shortcut to quickly develop intra-sentential code-switching recognition skill.
arXiv Detail & Related papers (2022-10-26T13:19:35Z) - Improving Mandarin End-to-End Speech Recognition with Word N-gram
Language Model [57.92200214957124]
External language models (LMs) are used to improve the recognition performance of end-to-end (E2E) automatic speech recognition (ASR) systems.
We propose a novel decoding algorithm where a word-level lattice is constructed on-the-fly to consider all possible word sequences.
Our method consistently outperforms subword-level LMs, including N-gram LM and neural network LM.
arXiv Detail & Related papers (2022-01-06T10:04:56Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z) - Streaming Language Identification using Combination of Acoustic
Representations and ASR Hypotheses [13.976935216584298]
A common approach to solve multilingual speech recognition is to run multiple monolingual ASR systems in parallel.
We propose an approach that learns and combines acoustic level representations with embeddings estimated on ASR hypotheses.
To reduce the processing cost and latency, we exploit a streaming architecture to identify the spoken language early.
arXiv Detail & Related papers (2020-06-01T04:08:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.