Gujarati-English Code-Switching Speech Recognition using ensemble
prediction of spoken language
- URL: http://arxiv.org/abs/2403.08011v1
- Date: Tue, 12 Mar 2024 18:21:20 GMT
- Title: Gujarati-English Code-Switching Speech Recognition using ensemble
prediction of spoken language
- Authors: Yash Sharma, Basil Abraham, Preethi Jyothi
- Abstract summary: We propose two methods of introducing language specific parameters and explainability in the multi-head attention mechanism.
Despite being unable to reduce WER significantly, our method shows promise in predicting the correct language from just spoken data.
- Score: 29.058108207186816
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: An important and difficult task in code-switched speech recognition is to
recognize the language, as lots of words in two languages can sound similar,
especially in some accents. We focus on improving performance of end-to-end
Automatic Speech Recognition models by conditioning transformer layers on
language ID of words and character in the output in an per layer supervised
manner. To this end, we propose two methods of introducing language specific
parameters and explainability in the multi-head attention mechanism, and
implement a Temporal Loss that helps maintain continuity in input alignment.
Despite being unable to reduce WER significantly, our method shows promise in
predicting the correct language from just spoken data. We introduce
regularization in the language prediction by dropping LID in the sequence,
which helps align long repeated output sequences.
Related papers
- Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting [45.161909551392085]
We introduce an encoder prompting technique within the self-conditioned CTC framework, enabling language-specific adaptation of the CTC model in a zero-shot manner.
Our method has shown to significantly reduce errors by 28% on average and by 41% on low-resource languages.
arXiv Detail & Related papers (2024-06-18T13:38:58Z) - Leveraging Language ID to Calculate Intermediate CTC Loss for Enhanced
Code-Switching Speech Recognition [5.3545957730615905]
We introduce language identification information into the middle layer of the ASR model's encoder.
We aim to generate acoustic features that imply language distinctions in a more implicit way, reducing the model's confusion when dealing with language switching.
arXiv Detail & Related papers (2023-12-15T07:46:35Z) - MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech
Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR)
We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data.
Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z) - Align, Write, Re-order: Explainable End-to-End Speech Translation via
Operation Sequence Generation [37.48971774827332]
We propose to generate ST tokens out-of-order while remembering how to re-order them later.
We examine two variants of such operation sequences which enable generation of monotonic transcriptions and non-monotonic translations.
arXiv Detail & Related papers (2022-11-11T02:29:28Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Code-Switching without Switching: Language Agnostic End-to-End Speech
Translation [68.8204255655161]
We treat speech recognition and translation as one unified end-to-end speech translation problem.
By training LAST with both input languages, we decode speech into one target language, regardless of the input language.
arXiv Detail & Related papers (2022-10-04T10:34:25Z) - LAE: Language-Aware Encoder for Monolingual and Multilingual ASR [87.74794847245536]
A novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information.
Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level.
arXiv Detail & Related papers (2022-06-05T04:03:12Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.