PhoMT: A High-Quality and Large-Scale Benchmark Dataset for
Vietnamese-English Machine Translation
- URL: http://arxiv.org/abs/2110.12199v1
- Date: Sat, 23 Oct 2021 11:42:01 GMT
- Title: PhoMT: A High-Quality and Large-Scale Benchmark Dataset for
Vietnamese-English Machine Translation
- Authors: Long Doan, Linh The Nguyen, Nguyen Luong Tran, Thai Hoang, Dat Quoc
Nguyen
- Abstract summary: We introduce a high-quality and large-scale Vietnamese-English parallel dataset of 3.02M sentence pairs.
This is 2.9M pairs larger than the benchmark Vietnamese-English machine translation corpus IWSLT15.
In both automatic and human evaluations, the best performance is obtained by fine-tuning the pre-trained sequence-to-sequence denoising auto-encoder mBART.
- Score: 6.950742601378329
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a high-quality and large-scale Vietnamese-English parallel
dataset of 3.02M sentence pairs, which is 2.9M pairs larger than the benchmark
Vietnamese-English machine translation corpus IWSLT15. We conduct experiments
comparing strong neural baselines and well-known automatic translation engines
on our dataset and find that in both automatic and human evaluations: the best
performance is obtained by fine-tuning the pre-trained sequence-to-sequence
denoising auto-encoder mBART. To our best knowledge, this is the first
large-scale Vietnamese-English machine translation study. We hope our publicly
available dataset and study can serve as a starting point for future research
and applications on Vietnamese-English machine translation.
Related papers
- Improving Vietnamese-English Medical Machine Translation [14.172448099399407]
MedEV is a high-quality Vietnamese-English parallel dataset constructed specifically for the medical domain, comprising approximately 360K sentence pairs.
We conduct extensive experiments comparing Google Translate, ChatGPT (gpt-3.5-turbo), state-of-the-art Vietnamese-English neural machine translation models and pre-trained bilingual/multilingual sequence-to-sequence models on our new MedEV dataset.
Experimental results show that the best performance is achieved by fine-tuning "vinai-translate" for each translation direction.
arXiv Detail & Related papers (2024-03-28T06:07:15Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - Better Datastore, Better Translation: Generating Datastores from
Pre-Trained Models for Nearest Neural Machine Translation [48.58899349349702]
Nearest Neighbor Machine Translation (kNNMT) is a simple and effective method of augmenting neural machine translation (NMT) with a token-level nearest neighbor retrieval mechanism.
In this paper, we propose PRED, a framework that leverages Pre-trained models for Datastores in kNN-MT.
arXiv Detail & Related papers (2022-12-17T08:34:20Z) - BJTU-WeChat's Systems for the WMT22 Chat Translation Task [66.81525961469494]
This paper introduces the joint submission of the Beijing Jiaotong University and WeChat AI to the WMT'22 chat translation task for English-German.
Based on the Transformer, we apply several effective variants.
Our systems achieve 0.810 and 0.946 COMET scores.
arXiv Detail & Related papers (2022-11-28T02:35:04Z) - The Effect of Normalization for Bi-directional Amharic-English Neural
Machine Translation [53.907805815477126]
This paper presents the first relatively large-scale Amharic-English parallel sentence dataset.
We build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model.
The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.
arXiv Detail & Related papers (2022-10-27T07:18:53Z) - MTet: Multi-domain Translation for English and Vietnamese [10.126442202316825]
MTet is the largest publicly available parallel corpus for English-Vietnamese translation.
We release the first pretrained model EnViT5 for English and Vietnamese languages.
arXiv Detail & Related papers (2022-10-11T16:55:21Z) - A High-Quality and Large-Scale Dataset for English-Vietnamese Speech
Translation [17.35935715147861]
This paper introduces a high-quality and large-scale benchmark dataset for English-Vietnamese speech translation with 508 audio hours.
To the best of our knowledge, this is the first large-scale English-Vietnamese speech translation study.
arXiv Detail & Related papers (2022-08-08T16:11:26Z) - Quality-Aware Decoding for Neural Machine Translation [64.24934199944875]
We propose quality-aware decoding for neural machine translation (NMT)
We leverage recent breakthroughs in reference-free and reference-based MT evaluation through various inference methods.
We find that quality-aware decoding consistently outperforms MAP-based decoding according both to state-of-the-art automatic metrics and to human assessments.
arXiv Detail & Related papers (2022-05-02T15:26:28Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Sentence Extraction-Based Machine Reading Comprehension for Vietnamese [0.2446672595462589]
We introduce the UIT-ViWikiQA, the first dataset for evaluating sentence extraction-based machine reading comprehension in Vietnamese language.
The dataset consists of comprises 23.074 question-answers based on 5.109 passages of 174 Vietnamese articles from Wikipedia.
Our experiments show that the best machine model is XLM-R$_Large, which achieves an exact match (EM) score of 85.97% and an F1-score of 88.77% on our dataset.
arXiv Detail & Related papers (2021-05-19T10:22:27Z) - scb-mt-en-th-2020: A Large English-Thai Parallel Corpus [3.3072037841206354]
We construct an English-Thai machine translation dataset with over 1 million segment pairs.
We train machine translation models based on this dataset.
The dataset, pre-trained models, and source code to reproduce our work are available for public use.
arXiv Detail & Related papers (2020-07-07T15:14:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.