Related papers: Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review

Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review

URL: http://arxiv.org/abs/2503.04797v2
Date: Tue, 22 Apr 2025 05:10:55 GMT
Title: Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review
Authors: Rahul Raja, Arpita Vats,
Abstract summary: This review provides a comprehensive overview of available parallel corpora for Indic languages.<n>We critically examine the challenges faced in corpus creation, including linguistic diversity, script variation, and data scarcity.<n>We outline future directions, including leveraging cross-lingual transfer learning, expanding multilingual datasets, and integrating multimodal resources to enhance translation quality.
Score: 2.377892000761193
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Parallel corpora play an important role in training machine translation (MT) models, particularly for low-resource languages where high-quality bilingual data is scarce. This review provides a comprehensive overview of available parallel corpora for Indic languages, which span diverse linguistic families, scripts, and regional variations. We categorize these corpora into text-to-text, code-switched, and various categories of multimodal datasets, highlighting their significance in the development of robust multilingual MT systems. Beyond resource enumeration, we critically examine the challenges faced in corpus creation, including linguistic diversity, script variation, data scarcity, and the prevalence of informal textual content.We also discuss and evaluate these corpora in various terms such as alignment quality and domain representativeness. Furthermore, we address open challenges such as data imbalance across Indic languages, the trade-off between quality and quantity, and the impact of noisy, informal, and dialectal data on MT performance. Finally, we outline future directions, including leveraging cross-lingual transfer learning, expanding multilingual datasets, and integrating multimodal resources to enhance translation quality. To the best of our knowledge, this paper presents the first comprehensive review of parallel corpora specifically tailored for low-resource Indic languages in the context of machine translation.

Related papers

Analyzing and Improving Cross-lingual Knowledge Transfer for Machine Translation [5.878901309908815]
We study cross-lingual knowledge transfer in neural models and develop methods to improve robustness and generalization in multilingual settings.<n>We examine the role of language diversity during training and show that increasing translation coverage improves generalization and reduces off-target behavior.
arXiv Detail & Related papers (2026-01-07T15:51:54Z)
Contextual Cues in Machine Translation: Investigating the Potential of Multi-Source Input Strategies in LLMs and NMT Systems [2.512491726995032]
We compare GPT-4o, a large language model, with a traditional multilingual neural machine translation (NMT) system. Using intermediate language translations as contextual cues, we evaluate their effectiveness in enhancing English and Chinese translations into Portuguese. Results suggest that contextual information significantly improves translation quality for domain-specific datasets and potentially for linguistically distant language pairs.
arXiv Detail & Related papers (2025-03-10T11:23:44Z)
BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation [28.456351723077088]
BOUQuET is a multi-way, multicentric and multi-register/domain dataset and benchmark.<n>This dataset is handcrafted in 8 non-English languages.
arXiv Detail & Related papers (2025-02-06T18:56:37Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.<n>P-MMEval delivers consistent language coverage across various datasets and provides parallel samples.<n>We conduct extensive experiments on representative multilingual model series to compare performances across models and tasks.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
Cross-lingual Text Classification Transfer: The Case of Ukrainian [11.508759658889382]
Ukrainian stands as a language that can benefit from the continued refinement of cross-lingual methodologies.<n>Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks.<n>In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods.
arXiv Detail & Related papers (2024-04-02T15:37:09Z)
Towards a Deep Understanding of Multilingual End-to-End Speech Translation [52.26739715012842]
We analyze representations learnt in a multilingual end-to-end speech translation model trained over 22 languages. We derive three major findings from our analysis.
arXiv Detail & Related papers (2023-10-31T13:50:55Z)
Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation [103.90963418039473]
Bi-ACL is a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model. We show that Bi-ACL is more effective both in long-tail languages and in high-resource languages.
arXiv Detail & Related papers (2023-05-22T07:31:08Z)
Discourse Centric Evaluation of Machine Translation with a Densely Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al. We investigate the similarities and differences between the discourse structures of source and target languages. We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z)
Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding. COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z)
Towards the Next 1000 Languages in Multilingual Machine Translation: Exploring the Synergy Between Supervised and Self-Supervised Learning [48.15259834021655]
We present a pragmatic approach towards building a multilingual machine translation model that covers hundreds of languages. We use a mixture of supervised and self-supervised objectives, depending on the data availability for different language pairs. We demonstrate that the synergy between these two training paradigms enables the model to produce high-quality translations in the zero-resource setting.
arXiv Detail & Related papers (2022-01-09T23:36:44Z)
Modeling Bilingual Conversational Characteristics for Neural Chat Translation [24.94474722693084]
We aim to promote the translation quality of conversational text by modeling the above properties. We evaluate our approach on the benchmark dataset BConTrasT (English-German) and a self-collected bilingual dialogue corpus, named BMELD (English-Chinese) Our approach notably boosts the performance over strong baselines by a large margin and significantly surpasses some state-of-the-art context-aware NMT models in terms of BLEU and TER.
arXiv Detail & Related papers (2021-07-23T12:23:34Z)
Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source. We observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.