On the Impact of Noises in Crowd-Sourced Data for Speech Translation
- URL: http://arxiv.org/abs/2206.13756v1
- Date: Tue, 28 Jun 2022 05:01:06 GMT
- Title: On the Impact of Noises in Crowd-Sourced Data for Speech Translation
- Authors: Siqi Ouyang, Rong Ye, Lei Li
- Abstract summary: We find that MuST-C still suffers from three major quality issues: audio-text misalignment, inaccurate translation, and unnecessary speaker's name.
Our experiments show that ST models perform better on clean test sets, and the rank of proposed models remains consistent across different test sets.
- Score: 11.67083845641806
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Training speech translation (ST) models requires large and high-quality
datasets. MuST-C is one of the most widely used ST benchmark datasets. It
contains around 400 hours of speech-transcript-translation data for each of the
eight translation directions. This dataset passes several quality-control
filters during creation. However, we find that MuST-C still suffers from three
major quality issues: audio-text misalignment, inaccurate translation, and
unnecessary speaker's name. What are the impacts of these data quality issues
for model development and evaluation? In this paper, we propose an automatic
method to fix or filter the above quality issues, using English-German (En-De)
translation as an example. Our experiments show that ST models perform better
on clean test sets, and the rank of proposed models remains consistent across
different test sets. Besides, simply removing misaligned data points from the
training set does not lead to a better ST model.
Related papers
- NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets.
NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z) - Quantity vs. Quality of Monolingual Source Data in Automatic Text Translation: Can It Be Too Little If It Is Too Good? [2.492943108520374]
This study investigates whether the monolingual data can also be too little and if this reduction, based on quality, has any effect on the performance of the translation model.
Experiments have shown that on English-German low-resource NMT, it is often better to select only the most useful additional data, based on quality or to the domain of the test data, than utilizing all of the available data.
arXiv Detail & Related papers (2024-10-17T17:20:40Z) - Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis [3.16714407449467]
We investigate the role of translation and synthetic data in training language models.
We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model.
To rectify these issues, we pre-train the models with a small dataset of synthesized high-quality Arabic stories.
arXiv Detail & Related papers (2024-05-23T07:53:04Z) - There's no Data Like Better Data: Using QE Metrics for MT Data Filtering [25.17221095970304]
We analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems(NMT)
We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half.
arXiv Detail & Related papers (2023-11-09T13:21:34Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - M3ST: Mix at Three Levels for Speech Translation [66.71994367650461]
We propose Mix at three levels for Speech Translation (M3ST) method to increase the diversity of the augmented training corpus.
In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data.
Experiments on MuST-C speech translation benchmark and analysis show that M3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.
arXiv Detail & Related papers (2022-12-07T14:22:00Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Tackling data scarcity in speech translation using zero-shot
multilingual machine translation techniques [12.968557512440759]
Several techniques have been proposed for zero-shot translation.
We investigate whether these ideas can be applied to speech translation, by building ST models trained on speech transcription and text translation data.
The techniques were successfully applied to few-shot ST using limited ST data, with improvements of up to +12.9 BLEU points compared to direct end-to-end ST and +3.1 BLEU points compared to ST models fine-tuned from ASR model.
arXiv Detail & Related papers (2022-01-26T20:20:59Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Zero-shot Speech Translation [0.0]
Speech Translation (ST) is the task of translating speech in one language into text in another language.
End-to-end approaches use only one system to avoid propagating error, yet are difficult to employ due to data scarcity.
We explore zero-shot translation, which enables translating a pair of languages that is unseen during training.
arXiv Detail & Related papers (2021-07-13T12:00:44Z) - Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in
Non-Autoregressive Translation [98.11249019844281]
Knowledge distillation (KD) is commonly used to construct synthetic data for training non-autoregressive translation (NAT) models.
We propose reverse KD to rejuvenate more alignments for low-frequency target words.
Results demonstrate that the proposed approach can significantly and universally improve translation quality.
arXiv Detail & Related papers (2021-06-02T02:41:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.