A Case Study on Filtering for End-to-End Speech Translation
- URL: http://arxiv.org/abs/2402.01945v1
- Date: Fri, 2 Feb 2024 22:42:33 GMT
- Title: A Case Study on Filtering for End-to-End Speech Translation
- Authors: Md Mahfuz Ibn Alam and Antonios Anastasopoulos
- Abstract summary: It is relatively easy to mine a large parallel corpus for any machine learning task, such as speech-to-text or speech-to-speech translation.
This work shows that the simplest filtering technique can trim down these big, noisy datasets to a more manageable, clean dataset.
- Score: 32.676738355929466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It is relatively easy to mine a large parallel corpus for any machine
learning task, such as speech-to-text or speech-to-speech translation. Although
these mined corpora are large in volume, their quality is questionable. This
work shows that the simplest filtering technique can trim down these big, noisy
datasets to a more manageable, clean dataset. We also show that using this
clean dataset can improve the model's performance, as in the case of the
multilingual-to-English Speech Translation (ST) model, where, on average, we
obtain a 4.65 BLEU score improvement.
Related papers
- Setting up the Data Printer with Improved English to Ukrainian Machine Translation [0.0]
We introduce a recipe to build a translation system with a noisy parallel dataset of 3M pairs of Ukrainian and English sentences.
Our decoder-only model named Dragoman beats performance of previous state of the art encoder-decoder models on the FLORES devtest set.
arXiv Detail & Related papers (2024-04-23T16:34:34Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - Separating Grains from the Chaff: Using Data Filtering to Improve
Multilingual Translation for Low-Resourced African Languages [0.6947064688250465]
This work describes our approach, which is based on filtering the given noisy data using a sentence-pair classifier.
We empirically validate our approach by evaluating on two common datasets and show that data filtering generally improves overall translation quality.
arXiv Detail & Related papers (2022-10-19T16:12:27Z) - On the Impact of Noises in Crowd-Sourced Data for Speech Translation [11.67083845641806]
We find that MuST-C still suffers from three major quality issues: audio-text misalignment, inaccurate translation, and unnecessary speaker's name.
Our experiments show that ST models perform better on clean test sets, and the rank of proposed models remains consistent across different test sets.
arXiv Detail & Related papers (2022-06-28T05:01:06Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Improving Multilingual Translation by Representation and Gradient
Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level.
Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z) - Self-Supervised Representations Improve End-to-End Speech Translation [57.641761472372814]
We show that self-supervised pre-trained features can consistently improve the translation performance.
Cross-lingual transfer allows to extend to a variety of languages without or with little tuning.
arXiv Detail & Related papers (2020-06-22T10:28:38Z) - ESPnet-ST: All-in-One Speech Translation Toolkit [57.76342114226599]
ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet.
It implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation.
We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines.
arXiv Detail & Related papers (2020-04-21T18:38:38Z) - Machine Translation Pre-training for Data-to-Text Generation -- A Case
Study in Czech [5.609443065827995]
We study the effectiveness of machine translation based pre-training for data-to-text generation in non-English languages.
We find that pre-training lets us train end-to-end models with significantly improved performance.
arXiv Detail & Related papers (2020-04-05T02:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.