MTet: Multi-domain Translation for English and Vietnamese
- URL: http://arxiv.org/abs/2210.05610v1
- Date: Tue, 11 Oct 2022 16:55:21 GMT
- Title: MTet: Multi-domain Translation for English and Vietnamese
- Authors: Chinh Ngo, Trieu H. Trinh, Long Phan, Hieu Tran, Tai Dang, Hieu
Nguyen, Minh Nguyen and Minh-Thang Luong
- Abstract summary: MTet is the largest publicly available parallel corpus for English-Vietnamese translation.
We release the first pretrained model EnViT5 for English and Vietnamese languages.
- Score: 10.126442202316825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce MTet, the largest publicly available parallel corpus for
English-Vietnamese translation. MTet consists of 4.2M high-quality training
sentence pairs and a multi-domain test set refined by the Vietnamese research
community. Combining with previous works on English-Vietnamese translation, we
grow the existing parallel dataset to 6.2M sentence pairs. We also release the
first pretrained model EnViT5 for English and Vietnamese languages. Combining
both resources, our model significantly outperforms previous state-of-the-art
results by up to 2 points in translation BLEU score, while being 1.6 times
smaller.
Related papers
- Improving Vietnamese-English Medical Machine Translation [14.172448099399407]
MedEV is a high-quality Vietnamese-English parallel dataset constructed specifically for the medical domain, comprising approximately 360K sentence pairs.
We conduct extensive experiments comparing Google Translate, ChatGPT (gpt-3.5-turbo), state-of-the-art Vietnamese-English neural machine translation models and pre-trained bilingual/multilingual sequence-to-sequence models on our new MedEV dataset.
Experimental results show that the best performance is achieved by fine-tuning "vinai-translate" for each translation direction.
arXiv Detail & Related papers (2024-03-28T06:07:15Z) - BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual
Transfer [81.5984433881309]
We introduce BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format.
BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer.
Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer.
arXiv Detail & Related papers (2023-05-24T08:06:33Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - Tencent AI Lab - Shanghai Jiao Tong University Low-Resource Translation
System for the WMT22 Translation Task [49.916963624249355]
This paper describes Tencent AI Lab - Shanghai Jiao Tong University (TAL-SJTU) Low-Resource Translation systems for the WMT22 shared task.
We participate in the general translation task on English$Leftrightarrow$Livonian.
Our system is based on M2M100 with novel techniques that adapt it to the target language pair.
arXiv Detail & Related papers (2022-10-17T04:34:09Z) - Enriching Biomedical Knowledge for Low-resource Language Through
Translation [1.6347851388527643]
We make use of a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained as well as supervised data in the biomedical domains.
Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained-Decoder Transformer model trained on 20 million abstracts from the high-quality public corpus.
arXiv Detail & Related papers (2022-10-11T16:35:10Z) - PhoMT: A High-Quality and Large-Scale Benchmark Dataset for
Vietnamese-English Machine Translation [6.950742601378329]
We introduce a high-quality and large-scale Vietnamese-English parallel dataset of 3.02M sentence pairs.
This is 2.9M pairs larger than the benchmark Vietnamese-English machine translation corpus IWSLT15.
In both automatic and human evaluations, the best performance is obtained by fine-tuning the pre-trained sequence-to-sequence denoising auto-encoder mBART.
arXiv Detail & Related papers (2021-10-23T11:42:01Z) - Zero-shot Cross-lingual Transfer of Neural Machine Translation with
Multilingual Pretrained Encoders [74.89326277221072]
How to improve the cross-lingual transfer of NMT model with multilingual pretrained encoder is under-explored.
We propose SixT, a simple yet effective model for this task.
Our model achieves better performance on many-to-English testsets than CRISS and m2m-100.
arXiv Detail & Related papers (2021-04-18T07:42:45Z) - mT6: Multilingual Pretrained Text-to-Text Transformer with Translation
Pairs [51.67970832510462]
We improve multilingual text-to-text transfer Transformer with translation pairs (mT6)
We explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption.
Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.
arXiv Detail & Related papers (2021-04-18T03:24:07Z) - A Corpus for English-Japanese Multimodal Neural Machine Translation with
Comparable Sentences [21.43163704217968]
We propose a new multimodal English-Japanese corpus with comparable sentences that are compiled from existing image captioning datasets.
Due to low translation scores in our baseline experiments, we believe that current multimodal NMT models are not designed to effectively utilize comparable sentence data.
arXiv Detail & Related papers (2020-10-17T06:12:25Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese [11.782566169354725]
We present the first public large-scale Text-to-resource semantic parsing dataset for Vietnamese.
We find that automatic Vietnamese word segmentation improves the parsing results of both baselines.
PhoBERT for Vietnamese helps produce higher performances than the recent best multilingual language model XLM-R.
arXiv Detail & Related papers (2020-10-05T09:54:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.