An Evaluation of Persian-English Machine Translation Datasets with
Transformers
- URL: http://arxiv.org/abs/2302.00321v1
- Date: Wed, 1 Feb 2023 08:55:08 GMT
- Title: An Evaluation of Persian-English Machine Translation Datasets with
Transformers
- Authors: Amir Sartipi, Meghdad Dehghan, Afsaneh Fatemi
- Abstract summary: This study collected and analysed the most popular and valuable parallel corpora, which were used for Persian-English translation.
We fine-tuned and evaluated two state-of-the-art attention-based seq2seq models on each dataset separately.
- Score: 1.0742675209112622
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nowadays, many researchers are focusing their attention on the subject of
machine translation (MT). However, Persian machine translation has remained
unexplored despite a vast amount of research being conducted in languages with
high resources, such as English. Moreover, while a substantial amount of
research has been undertaken in statistical machine translation for some
datasets in Persian, there is currently no standard baseline for
transformer-based text2text models on each corpus. This study collected and
analysed the most popular and valuable parallel corpora, which were used for
Persian-English translation. Furthermore, we fine-tuned and evaluated two
state-of-the-art attention-based seq2seq models on each dataset separately (48
results). We hope this paper will assist researchers in comparing their Persian
to English and vice versa machine translation results to a standard baseline.
Related papers
- ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations [0.0]
ArzEn-MultiGenre is a parallel dataset of Egyptian Arabic song lyrics, novels, and TV show subtitles that are manually translated and aligned with their English counterparts.<n>The dataset contains 25,557 segment pairs that can be used to benchmark new machine translation models, fine-tune large language models in few-shot settings, and adapt commercial machine translation applications such as Google Translate.
arXiv Detail & Related papers (2025-08-02T15:28:41Z) - Evaluating Machine Translation Models for English-Hindi Language Pairs: A Comparative Analysis [0.0]
The study aims to provide insights into the effectiveness of different machine translation approaches in handling both general and specialized language domains.<n>Results indicate varying performance levels across different metrics, highlighting strengths and areas for improvement in current translation systems.
arXiv Detail & Related papers (2025-05-26T07:15:06Z) - Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions.
We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z) - An approach for mistranslation removal from popular dataset for Indic MT
Task [5.4755933832880865]
We propose an algorithm to remove mistranslations from the training corpus and evaluate its performance and efficiency.
Two Indic languages (ILs), namely, Hindi (HIN) and Odia (ODI) are chosen for the experiment.
The quality of the translations in the experiment is evaluated using standard metrics such as BLEU, METEOR, and RIBES.
arXiv Detail & Related papers (2024-01-12T06:37:19Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Revisiting Machine Translation for Cross-lingual Classification [91.43729067874503]
Most research in the area focuses on the multilingual models rather than the Machine Translation component.
We show that, by using a stronger MT system and mitigating the mismatch between training on original text and running inference on machine translated text, translate-test can do substantially better than previously assumed.
arXiv Detail & Related papers (2023-05-23T16:56:10Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents [12.493662336994106]
We present an abstractive cross-lingual summarization dataset for four different languages in the scholarly domain.
We train and evaluate models that process English papers and generate summaries in German, Italian, Chinese and Japanese.
arXiv Detail & Related papers (2022-05-30T12:31:28Z) - A Large-Scale Study of Machine Translation in the Turkic Languages [7.3458368273762815]
Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems.
However, there is still a large number of languages that are yet to reap the benefits of NMT.
This paper provides the first large-scale case study of the practical application of MT in the Turkic language family.
arXiv Detail & Related papers (2021-09-09T23:56:30Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Bootstrapping a Crosslingual Semantic Parser [74.99223099702157]
We adapt a semantic trained on a single language, such as English, to new languages and multiple domains with minimal annotation.
We query if machine translation is an adequate substitute for training data, and extend this to investigate bootstrapping using joint training with English, paraphrasing, and multilingual pre-trained models.
arXiv Detail & Related papers (2020-04-06T12:05:02Z) - Machine Translation Pre-training for Data-to-Text Generation -- A Case
Study in Czech [5.609443065827995]
We study the effectiveness of machine translation based pre-training for data-to-text generation in non-English languages.
We find that pre-training lets us train end-to-end models with significantly improved performance.
arXiv Detail & Related papers (2020-04-05T02:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.