An approach for mistranslation removal from popular dataset for Indic MT
Task
- URL: http://arxiv.org/abs/2401.06398v1
- Date: Fri, 12 Jan 2024 06:37:19 GMT
- Title: An approach for mistranslation removal from popular dataset for Indic MT
Task
- Authors: Sudhansu Bala Das, Leo Raphael Rodrigues, Tapas Kumar Mishra, Bidyut
Kr. Patra
- Abstract summary: We propose an algorithm to remove mistranslations from the training corpus and evaluate its performance and efficiency.
Two Indic languages (ILs), namely, Hindi (HIN) and Odia (ODI) are chosen for the experiment.
The quality of the translations in the experiment is evaluated using standard metrics such as BLEU, METEOR, and RIBES.
- Score: 5.4755933832880865
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The conversion of content from one language to another utilizing a computer
system is known as Machine Translation (MT). Various techniques have come up to
ensure effective translations that retain the contextual and lexical
interpretation of the source language. End-to-end Neural Machine Translation
(NMT) is a popular technique and it is now widely used in real-world MT
systems. Massive amounts of parallel datasets (sentences in one language
alongside translations in another) are required for MT systems. These datasets
are crucial for an MT system to learn linguistic structures and patterns of
both languages during the training phase. One such dataset is Samanantar, the
largest publicly accessible parallel dataset for Indian languages (ILs). Since
the corpus has been gathered from various sources, it contains many incorrect
translations. Hence, the MT systems built using this dataset cannot perform to
their usual potential. In this paper, we propose an algorithm to remove
mistranslations from the training corpus and evaluate its performance and
efficiency. Two Indic languages (ILs), namely, Hindi (HIN) and Odia (ODI) are
chosen for the experiment. A baseline NMT system is built for these two ILs,
and the effect of different dataset sizes is also investigated. The quality of
the translations in the experiment is evaluated using standard metrics such as
BLEU, METEOR, and RIBES. From the results, it is observed that removing the
incorrect translation from the dataset makes the translation quality better. It
is also noticed that, despite the fact that the ILs-English and English-ILs
systems are trained using the same corpus, ILs-English works more effectively
across all the evaluation metrics.
Related papers
- Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions.
We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z) - Translation-Enhanced Multilingual Text-to-Image Generation [61.41730893884428]
Research on text-to-image generation (TTI) still predominantly focuses on the English language.
In this work, we thus investigate multilingual TTI and the current potential of neural machine translation (NMT) to bootstrap mTTI systems.
We propose Ensemble Adapter (EnsAd), a novel parameter-efficient approach that learns to weigh and consolidate the multilingual text knowledge within the mTTI framework.
arXiv Detail & Related papers (2023-05-30T17:03:52Z) - Statistical Machine Translation for Indic Languages [1.8899300124593648]
This paper canvasses about the development of bilingual Statistical Machine Translation models.
To create the system, MOSES open-source SMT toolkit is explored.
In our experiment, the quality of the translation is evaluated using standard metrics such as BLEU, METEOR, and RIBES.
arXiv Detail & Related papers (2023-01-02T06:23:12Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Improving Multilingual Neural Machine Translation System for Indic
Languages [0.0]
We propose a multilingual neural machine translation (MNMT) system to address the issues related to low-resource language translation.
A state-of-the-art transformer architecture is used to realize the proposed model.
Trials over a good amount of data reveal its superiority over the conventional models.
arXiv Detail & Related papers (2022-09-27T09:51:56Z) - DivEMT: Neural Machine Translation Post-Editing Effort Across
Typologically Diverse Languages [5.367993194110256]
DivEMT is the first publicly available post-editing study of Neural Machine Translation (NMT) over a typologically diverse set of target languages.
We assess the impact on translation productivity of two state-of-the-art NMT systems, namely: Google Translate and the open-source multilingual model mBART50.
arXiv Detail & Related papers (2022-05-24T17:22:52Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Selecting Backtranslated Data from Multiple Sources for Improved Neural
Machine Translation [8.554761233491236]
We analyse the impact that data translated with rule-based, phrase-based statistical and neural MT systems has on new MT systems.
We exploit different data selection strategies in order to reduce the amount of data used, while at the same time maintaining high-quality MT systems.
arXiv Detail & Related papers (2020-05-01T10:50:53Z) - Neural Machine Translation for Low-Resourced Indian Languages [4.726777092009554]
Machine translation is an effective approach to convert text to a different language without any human involvement.
In this paper, we have applied NMT on two of the most morphological rich Indian languages, i.e. English-Tamil and English-Malayalam.
We proposed a novel NMT model using Multihead self-attention along with pre-trained Byte-Pair-Encoded (BPE) and MultiBPE embeddings to develop an efficient translation system.
arXiv Detail & Related papers (2020-04-19T17:29:34Z) - Bootstrapping a Crosslingual Semantic Parser [74.99223099702157]
We adapt a semantic trained on a single language, such as English, to new languages and multiple domains with minimal annotation.
We query if machine translation is an adequate substitute for training data, and extend this to investigate bootstrapping using joint training with English, paraphrasing, and multilingual pre-trained models.
arXiv Detail & Related papers (2020-04-06T12:05:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.