GigaST: A 10,000-hour Pseudo Speech Translation Corpus
- URL: http://arxiv.org/abs/2204.03939v2
- Date: Tue, 6 Jun 2023 12:48:48 GMT
- Title: GigaST: A 10,000-hour Pseudo Speech Translation Corpus
- Authors: Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng, Tao Wang, Mingxuan Wang,
Jun Cao
- Abstract summary: GigaST is a large-scale pseudo speech translation (ST) corpus.
We create the corpus by translating the text in GigaSpeech, an English ASR corpus, into German and Chinese.
The training set is translated by a strong machine translation system and the test set is translated by human.
- Score: 33.572303016012384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces GigaST, a large-scale pseudo speech translation (ST)
corpus. We create the corpus by translating the text in GigaSpeech, an English
ASR corpus, into German and Chinese. The training set is translated by a strong
machine translation system and the test set is translated by human. ST models
trained with an addition of our corpus obtain new state-of-the-art results on
the MuST-C English-German benchmark test set. We provide a detailed description
of the translation process and verify its quality. We make the translated text
data public and hope to facilitate research in speech translation.
Additionally, we also release the training scripts on NeurST to make it easy to
replicate our systems. GigaST dataset is available at
https://st-benchmark.github.io/resources/GigaST.
Related papers
- SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - WeTS: A Benchmark for Translation Suggestion [32.10692757420455]
We create a benchmark data set for Translation Suggestion (TS) called emphWeTS.
We also propose several novel methods to generate synthetic corpus which can substantially improve the performance of TS.
Our model achieves State-Of-The-Art (SOTA) results on all four translation directions, including English-to-German, German-to-English, Chinese-to-English and English-to-Chinese.
arXiv Detail & Related papers (2021-10-11T10:52:17Z) - Active Learning for Massively Parallel Translation of Constrained Text
into Low Resource Languages [26.822210580244885]
We translate a closed text that is known in advance and available in many languages into a new and severely low resource language.
We compare the portion-based approach that optimize coherence of the text locally with the random sampling approach that increases coverage of the text globally.
We propose an algorithm for human and machine to work together seamlessly to translate a closed text into a severely low resource language.
arXiv Detail & Related papers (2021-08-16T14:49:50Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - "Listen, Understand and Translate": Triple Supervision Decouples
End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language.
Existing methods are limited by the amount of parallel corpus.
We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z) - An Augmented Translation Technique for low Resource language pair:
Sanskrit to Hindi translation [0.0]
In this work, Zero Shot Translation (ZST) is inspected for a low resource language pair.
The same architecture is tested for Sanskrit to Hindi translation for which data is sparse.
Dimensionality reduction of word embedding is performed to reduce the memory usage for data storage.
arXiv Detail & Related papers (2020-06-09T17:01:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.