scb-mt-en-th-2020: A Large English-Thai Parallel Corpus
- URL: http://arxiv.org/abs/2007.03541v1
- Date: Tue, 7 Jul 2020 15:14:32 GMT
- Title: scb-mt-en-th-2020: A Large English-Thai Parallel Corpus
- Authors: Lalita Lowphansirikul, Charin Polpanumas, Attapol T. Rutherford and
Sarana Nutanong
- Abstract summary: We construct an English-Thai machine translation dataset with over 1 million segment pairs.
We train machine translation models based on this dataset.
The dataset, pre-trained models, and source code to reproduce our work are available for public use.
- Score: 3.3072037841206354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The primary objective of our work is to build a large-scale English-Thai
dataset for machine translation. We construct an English-Thai machine
translation dataset with over 1 million segment pairs, curated from various
sources, namely news, Wikipedia articles, SMS messages, task-based dialogs,
web-crawled data and government documents. Methodology for gathering data,
building parallel texts and removing noisy sentence pairs are presented in a
reproducible manner. We train machine translation models based on this dataset.
Our models' performance are comparable to that of Google Translation API (as of
May 2020) for Thai-English and outperform Google when the Open Parallel Corpus
(OPUS) is included in the training data for both Thai-English and English-Thai
translation. The dataset, pre-trained models, and source code to reproduce our
work are available for public use.
Related papers
- Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - Setting up the Data Printer with Improved English to Ukrainian Machine Translation [0.0]
We introduce a recipe to build a translation system with a noisy parallel dataset of 3M pairs of Ukrainian and English sentences.
Our decoder-only model named Dragoman beats performance of previous state of the art encoder-decoder models on the FLORES devtest set.
arXiv Detail & Related papers (2024-04-23T16:34:34Z) - A Tulu Resource for Machine Translation [3.038642416291856]
We present the first parallel dataset for English-Tulu translation.
Tulu is spoken by approximately 2.5 million individuals in southwestern India.
Our English-Tulu system, trained without using parallel English-Tulu data, outperforms Google Translate by 19 BLEU points.
arXiv Detail & Related papers (2024-03-28T04:30:07Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - The Effect of Normalization for Bi-directional Amharic-English Neural
Machine Translation [53.907805815477126]
This paper presents the first relatively large-scale Amharic-English parallel sentence dataset.
We build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model.
The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.
arXiv Detail & Related papers (2022-10-27T07:18:53Z) - Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking [84.50302759362698]
We enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models.
We use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks.
We achieve impressive improvements (> 20% on goal accuracy) on the parallel MultiWoZ dataset and Multilingual WoZ dataset.
arXiv Detail & Related papers (2021-09-28T11:22:38Z) - Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese.
We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset.
We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - The Tatoeba Translation Challenge -- Realistic Data Sets for Low
Resource and Multilingual MT [0.0]
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs.
The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
arXiv Detail & Related papers (2020-10-13T13:12:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.