Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New
Datasets for Bengali-English Machine Translation
- URL: http://arxiv.org/abs/2009.09359v2
- Date: Wed, 7 Oct 2020 05:33:13 GMT
- Title: Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New
Datasets for Bengali-English Machine Translation
- Authors: Tahmid Hasan, Abhik Bhattacharjee, Kazi Samin, Masum Hasan, Madhusudan
Basak, M. Sohel Rahman, Rifat Shahriyar
- Abstract summary: Despite being the seventh most widely spoken language in the world, Bengali has received much less attention in machine translation literature due to being low in resources.
We build a customized sentence segmenter for Bengali and propose two novel methods for parallel corpus creation on low-resource setups.
With the segmenter and the two methods combined, we compile a high-quality Bengali-English parallel corpus comprising of 2.75 million sentence pairs.
- Score: 6.2418269277908065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite being the seventh most widely spoken language in the world, Bengali
has received much less attention in machine translation literature due to being
low in resources. Most publicly available parallel corpora for Bengali are not
large enough; and have rather poor quality, mostly because of incorrect
sentence alignments resulting from erroneous sentence segmentation, and also
because of a high volume of noise present in them. In this work, we build a
customized sentence segmenter for Bengali and propose two novel methods for
parallel corpus creation on low-resource setups: aligner ensembling and batch
filtering. With the segmenter and the two methods combined, we compile a
high-quality Bengali-English parallel corpus comprising of 2.75 million
sentence pairs, more than 2 million of which were not available before.
Training on neural models, we achieve an improvement of more than 9 BLEU score
over previous approaches to Bengali-English machine translation. We also
evaluate on a new test set of 1000 pairs made with extensive quality control.
We release the segmenter, parallel corpus, and the evaluation set, thus
elevating Bengali from its low-resource status. To the best of our knowledge,
this is the first ever large scale study on Bengali-English machine
translation. We believe our study will pave the way for future research on
Bengali-English machine translation as well as other low-resource languages.
Our data and code are available at https://github.com/csebuetnlp/banglanmt.
Related papers
- Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs [2.309018557701645]
We aim to explore the question of whether there is a need for English-oriented Large Language Models dedicated to a low-resource language.
We compare the performance of open-weight and closed-source LLMs against fine-tuned encoder-decoder models.
Our findings reveal that while LLMs generally excel in reasoning tasks, their performance in tasks requiring Bengali script generation is inconsistent.
arXiv Detail & Related papers (2024-06-29T11:50:16Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - Tencent AI Lab - Shanghai Jiao Tong University Low-Resource Translation
System for the WMT22 Translation Task [49.916963624249355]
This paper describes Tencent AI Lab - Shanghai Jiao Tong University (TAL-SJTU) Low-Resource Translation systems for the WMT22 shared task.
We participate in the general translation task on English$Leftrightarrow$Livonian.
Our system is based on M2M100 with novel techniques that adapt it to the target language pair.
arXiv Detail & Related papers (2022-10-17T04:34:09Z) - Ensembling of Distilled Models from Multi-task Teachers for Constrained
Resource Language Pairs [0.0]
We focus on the three relatively low resource language pairs Bengali to and from Hindi, English to and from Hausa, and Xhosa to and from Zulu.
We train a multilingual model using a multitask objective employing both parallel and monolingual data.
We see around 70% relative gain in BLEU point for English to and from Hausa, and around 25% relative improvements for both Bengali to and from Hindi, and Xhosa to and from Zulu.
arXiv Detail & Related papers (2021-11-26T00:54:37Z) - BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine
Translation [53.55009917938002]
We propose to refine the mined bitexts via automatic editing.
Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to 8 BLEU points.
arXiv Detail & Related papers (2021-11-12T16:00:39Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Simple or Complex? Learning to Predict Readability of Bengali Texts [6.860272388539321]
We present a readability analysis tool capable of analyzing text written in the Bengali language.
Despite being the 7th most spoken language in the world with 230 million native speakers, Bengali suffers from a lack of fundamental resources for natural language processing.
arXiv Detail & Related papers (2020-12-09T01:41:35Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Anchor-based Bilingual Word Embeddings for Low-Resource Languages [76.48625630211943]
Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text.
MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs.
This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point.
arXiv Detail & Related papers (2020-10-23T19:17:00Z) - Leveraging Multilingual News Websites for Building a Kurdish Parallel
Corpus [0.6445605125467573]
We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji.
We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani.
arXiv Detail & Related papers (2020-10-04T11:52:50Z) - Neural Machine Translation for Low-Resourced Indian Languages [4.726777092009554]
Machine translation is an effective approach to convert text to a different language without any human involvement.
In this paper, we have applied NMT on two of the most morphological rich Indian languages, i.e. English-Tamil and English-Malayalam.
We proposed a novel NMT model using Multihead self-attention along with pre-trained Byte-Pair-Encoded (BPE) and MultiBPE embeddings to develop an efficient translation system.
arXiv Detail & Related papers (2020-04-19T17:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.