A High-Quality and Large-Scale Dataset for English-Vietnamese Speech
Translation
- URL: http://arxiv.org/abs/2208.04243v1
- Date: Mon, 8 Aug 2022 16:11:26 GMT
- Title: A High-Quality and Large-Scale Dataset for English-Vietnamese Speech
Translation
- Authors: Linh The Nguyen, Nguyen Luong Tran, Long Doan, Manh Luong, Dat Quoc
Nguyen
- Abstract summary: This paper introduces a high-quality and large-scale benchmark dataset for English-Vietnamese speech translation with 508 audio hours.
To the best of our knowledge, this is the first large-scale English-Vietnamese speech translation study.
- Score: 17.35935715147861
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce a high-quality and large-scale benchmark dataset
for English-Vietnamese speech translation with 508 audio hours, consisting of
331K triplets of (sentence-lengthed audio, English source transcript sentence,
Vietnamese target subtitle sentence). We also conduct empirical experiments
using strong baselines and find that the traditional "Cascaded" approach still
outperforms the modern "End-to-End" approach. To the best of our knowledge,
this is the first large-scale English-Vietnamese speech translation study. We
hope both our publicly available dataset and study can serve as a starting
point for future research and applications on English-Vietnamese speech
translation. Our dataset is available at https://github.com/VinAIResearch/PhoST
Related papers
- VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension [1.3942150186842373]
This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension tasks.
The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or textbooks.
In contrast, the VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube.
arXiv Detail & Related papers (2024-02-05T00:54:40Z) - LyricSIM: A novel Dataset and Benchmark for Similarity Detection in
Spanish Song LyricS [52.77024349608834]
We present a new dataset and benchmark tailored to the task of semantic similarity in song lyrics.
Our dataset, originally consisting of 2775 pairs of Spanish songs, was annotated in a collective annotation experiment by 63 native annotators.
arXiv Detail & Related papers (2023-06-02T07:48:20Z) - Textless Speech-to-Speech Translation With Limited Parallel Data [51.3588490789084]
PFB is a framework for training textless S2ST models that require just dozens of hours of parallel speech data.
We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains.
arXiv Detail & Related papers (2023-05-24T17:59:05Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - PhoMT: A High-Quality and Large-Scale Benchmark Dataset for
Vietnamese-English Machine Translation [6.950742601378329]
We introduce a high-quality and large-scale Vietnamese-English parallel dataset of 3.02M sentence pairs.
This is 2.9M pairs larger than the benchmark Vietnamese-English machine translation corpus IWSLT15.
In both automatic and human evaluations, the best performance is obtained by fine-tuning the pre-trained sequence-to-sequence denoising auto-encoder mBART.
arXiv Detail & Related papers (2021-10-23T11:42:01Z) - A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese [11.782566169354725]
We present the first public large-scale Text-to-resource semantic parsing dataset for Vietnamese.
We find that automatic Vietnamese word segmentation improves the parsing results of both baselines.
PhoBERT for Vietnamese helps produce higher performances than the recent best multilingual language model XLM-R.
arXiv Detail & Related papers (2020-10-05T09:54:51Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z) - Enhancing lexical-based approach with external knowledge for Vietnamese
multiple-choice machine reading comprehension [2.5199066832791535]
We construct a dataset which consists of 2,783 pairs of multiple-choice questions and answers based on 417 Vietnamese texts.
We propose a lexical-based MRC method that utilizes semantic similarity measures and external knowledge sources to analyze questions and extract answers from the given text.
Our proposed method achieves 61.81% by accuracy, which is 5.51% higher than the best baseline model.
arXiv Detail & Related papers (2020-01-16T08:09:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.