High-Resource Translation:Turning Abundance into Accessibility
- URL: http://arxiv.org/abs/2504.05914v1
- Date: Tue, 08 Apr 2025 11:09:51 GMT
- Title: High-Resource Translation:Turning Abundance into Accessibility
- Authors: Abhiram Reddy Yanampally,
- Abstract summary: This paper presents a novel approach to constructing an English-to-Telugu translation model by leveraging transfer learning techniques.<n>The model incorporates iterative backtranslation to generate synthetic parallel data, effectively augmenting the training dataset and enhancing the model's translation capabilities.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a novel approach to constructing an English-to-Telugu translation model by leveraging transfer learning techniques and addressing the challenges associated with low-resource languages. Utilizing the Bharat Parallel Corpus Collection (BPCC) as the primary dataset, the model incorporates iterative backtranslation to generate synthetic parallel data, effectively augmenting the training dataset and enhancing the model's translation capabilities. The research focuses on a comprehensive strategy for improving model performance through data augmentation, optimization of training parameters, and the effective use of pre-trained models. These methodologies aim to create a robust translation system that can handle diverse sentence structures and linguistic nuances in both English and Telugu. This work highlights the significance of innovative data handling techniques and the potential of transfer learning in overcoming limitations posed by sparse datasets in low-resource languages. The study contributes to the field of machine translation and seeks to improve communication between English and Telugu speakers in practical contexts.
Related papers
- PAD: Towards Efficient Data Generation for Transfer Learning Using Phrase Alignment [5.0560627648135785]
We show that Phrase Aligned Data (PAD) synergizes effectively with the syntactic characteristics of the Korean language.<n>This innovative approach not only boosts model performance but also suggests a cost-efficient solution for resource-scarce languages.
arXiv Detail & Related papers (2025-03-24T00:29:05Z) - Cross-Lingual Transfer for Low-Resource Natural Language Processing [0.32634122554914]
Cross-lingual transfer learning is a research area aimed at leveraging data and models from high-resource languages to improve NLP performance.<n>This thesis presents a new method to improve data-based transfer with T-Projection, a state-of-the-art annotation projection method.<n>For model-based transfer, we introduce a constrained decoding algorithm that enhances cross-lingual Sequence Labeling in zero-shot settings.<n>Finally, we develop Medical mT5, the first multilingual text-to-text medical model.
arXiv Detail & Related papers (2025-02-04T21:17:46Z) - Unsupervised Data Validation Methods for Efficient Model Training [0.0]
State-of-the-art models in natural language processing (NLP), text-to-speech (TTS), speech-to-text (STT) and vision-language models (VLM) rely heavily on large datasets.
This research explores key areas such as defining "quality data," developing methods for generating appropriate data and enhancing accessibility to model training.
arXiv Detail & Related papers (2024-10-10T13:00:53Z) - Boosting the Capabilities of Compact Models in Low-Data Contexts with Large Language Models and Retrieval-Augmented Generation [2.9921619703037274]
We propose a retrieval augmented generation (RAG) framework backed by a large language model (LLM) to correct the output of a smaller model for the linguistic task of morphological glossing.
We leverage linguistic information to make up for the lack of data and trainable parameters, while allowing for inputs from written descriptive grammars interpreted and distilled through an LLM.
We show that a compact, RAG-supported model is highly effective in data-scarce settings, achieving a new state-of-the-art for this task and our target languages.
arXiv Detail & Related papers (2024-10-01T04:20:14Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Continual Knowledge Distillation for Neural Machine Translation [74.03622486218597]
parallel corpora are not publicly accessible for data copyright, data privacy and competitive differentiation reasons.
We propose a method called continual knowledge distillation to take advantage of existing translation models to improve one model of interest.
arXiv Detail & Related papers (2022-12-18T14:41:13Z) - On the Usability of Transformers-based models for a French
Question-Answering task [2.44288434255221]
This paper focuses on the usability of Transformer-based language models in small-scale learning problems.
We introduce a new compact model for French FrALBERT which proves to be competitive in low-resource settings.
arXiv Detail & Related papers (2022-07-19T09:46:15Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - Dynamic Data Selection and Weighting for Iterative Back-Translation [116.14378571769045]
We propose a curriculum learning strategy for iterative back-translation models.
We evaluate our models on domain adaptation, low-resource, and high-resource MT settings.
Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.
arXiv Detail & Related papers (2020-04-07T19:49:58Z) - Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer [64.22926988297685]
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP)
In this paper, we explore the landscape of introducing transfer learning techniques for NLP by a unified framework that converts all text-based language problems into a text-to-text format.
arXiv Detail & Related papers (2019-10-23T17:37:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.