Speech collage: code-switched audio generation by collaging monolingual
corpora
- URL: http://arxiv.org/abs/2309.15674v1
- Date: Wed, 27 Sep 2023 14:17:53 GMT
- Title: Speech collage: code-switched audio generation by collaging monolingual
corpora
- Authors: Amir Hussein, Dorsa Zeinali, Ond\v{r}ej Klejch, Matthew Wiesner, Brian
Yan, Shammur Chowdhury, Ahmed Ali, Shinji Watanabe, Sanjeev Khudanpur
- Abstract summary: Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
- Score: 50.356820349870986
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Designing effective automatic speech recognition (ASR) systems for
Code-Switching (CS) often depends on the availability of the transcribed CS
resources. To address data scarcity, this paper introduces Speech Collage, a
method that synthesizes CS data from monolingual corpora by splicing audio
segments. We further improve the smoothness quality of audio generation using
an overlap-add approach. We investigate the impact of generated data on speech
recognition in two scenarios: using in-domain CS text and a zero-shot approach
with synthesized CS text. Empirical results highlight up to 34.4% and 16.2%
relative reductions in Mixed-Error Rate and Word-Error Rate for in-domain and
zero-shot scenarios, respectively. Lastly, we demonstrate that CS augmentation
bolsters the model's code-switching inclination and reduces its monolingual
bias.
Related papers
- VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - Unified model for code-switching speech recognition and language
identification based on a concatenated tokenizer [17.700515986659063]
Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation.
This paper proposes a new method for creating code-switching ASR datasets from purely monolingual data sources.
A novel Concatenated Tokenizer enables ASR models to generate language ID for each emitted text token while reusing existing monolingual tokenizers.
arXiv Detail & Related papers (2023-06-14T21:24:11Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Code-Switching Text Augmentation for Multilingual Speech Processing [36.302629721413155]
Code-switching in spoken content has enforced ASR systems to handle mixed input.
Recent ASR studies showed the predominance of E2E-ASR using multilingual data to handle CS phenomena.
We propose a methodology to augment the monolingual data for artificially generating spoken CS text to improve different speech modules.
arXiv Detail & Related papers (2022-01-07T17:14:19Z) - Integrating Knowledge in End-to-End Automatic Speech Recognition for
Mandarin-English Code-Switching [41.88097793717185]
Code-Switching (CS) is a common linguistic phenomenon in multilingual communities.
This paper presents our investigations on end-to-end speech recognition for Mandarin-English CS speech.
arXiv Detail & Related papers (2021-12-19T17:31:15Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Mandarin-English Code-switching Speech Recognition with Self-supervised
Speech Representation Models [55.82292352607321]
Code-switching (CS) is common in daily conversations where more than one language is used within a sentence.
This paper uses the recently successful self-supervised learning (SSL) methods to leverage many unlabeled speech data without CS.
arXiv Detail & Related papers (2021-10-07T14:43:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.