Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning
- URL: http://arxiv.org/abs/2511.04406v1
- Date: Thu, 06 Nov 2025 14:33:29 GMT
- Title: Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning
- Authors: Mohammad Amin Ghanizadeh, Mohammad Javad Dousti,
- Abstract summary: This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems.<n>By defining a learnability score, our approach systematically evaluates the utility of data points for training.<n> Experiments on English to Persian and several other language pairs using an mBART model fine-tuned on the CCMatrix dataset demonstrate that our method can achieve up to a fivefold improvement in data efficiency.
- Score: 2.016758225924076
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems, which leverages the synergy between a learner model and a pre-trained reference model to enhance overall training effectiveness. By defining a learnability score, our approach systematically evaluates the utility of data points for training, ensuring that only the most relevant and impactful examples contribute to the fine-tuning process. Furthermore, our method employs a batch selection strategy which considers interdependencies among data points, optimizing the efficiency of the training process while maintaining a focus on data relevance. Experiments on English to Persian and several other language pairs using an mBART model fine-tuned on the CCMatrix dataset demonstrate that our method can achieve up to a fivefold improvement in data efficiency compared to an iid baseline. Experimental results indicate that our approach improves computational efficiency by 24 when utilizing cached embeddings, as it requires fewer training data points. Additionally, it enhances generalization, resulting in superior translation performance compared to random selection method.
Related papers
- Enhancing BERT Fine-Tuning for Sentiment Analysis in Lower-Resourced Languages [1.0535472555708638]
Limited data for low-resource languages typically yield weaker language models (LMs)<n>Since pre-training is compute-intensive, it is more pragmatic to target improvements during fine-tuning.<n>We propose an integrated fine-tuning pipeline that systematically combines AL, clustering, and dynamic data selection schedulers.
arXiv Detail & Related papers (2025-12-01T09:45:47Z) - Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection [29.647174797769015]
We introduce an approach that utilizes a parametric model for code data selection, aimed at improving both training efficiency and model performance.<n>Our method achieves gains of 2.4% (HumanEval) and 2.3% (MBPP) over 92K full-sampled baseline, outperforming other sampling approaches in both performance and efficiency.
arXiv Detail & Related papers (2025-07-03T07:19:56Z) - Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information [2.133855532092057]
We propose an effective data reduction strategy based on Pointwise V-Information (PVI)<n>Experiments show that classifier performance is maintained with only a 0.0001% to 0.76% decline in accuracy when 10%-30% of the data is removed.<n>We have adapted the PVI framework, which was previously limited to English datasets, to a variety of Chinese Natural Language Processing (NLP) tasks and base models.
arXiv Detail & Related papers (2025-06-19T06:59:19Z) - Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search [59.75749613951193]
We propose Data Influence-oriented Tree Search (DITS) to guide both tree search and data selection.<n>By leveraging influence scores, we effectively identify the most impactful data for system improvement.<n>We derive influence score estimation methods tailored for non-differentiable metrics.
arXiv Detail & Related papers (2025-02-02T23:20:16Z) - Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective [4.548047308860141]
This study investigates the impact of different type of preference data on model performance.
It aims to reduce their dependency on extensive amounts of preference data, which is expensive to collect.
arXiv Detail & Related papers (2024-10-22T00:11:41Z) - Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration [39.16321257800402]
We propose a multi-actor collaborative data selection mechanism to accelerate the pretraining of language models (LMs)<n>Each data selection method independently prioritizes data based on its criterion and updates its prioritization rules using the current state of the model.<n>A console is designed to adjust the impacts of different actors at various stages and dynamically integrate information from all actors throughout the LM pretraining process.
arXiv Detail & Related papers (2024-10-10T16:45:28Z) - One-Shot Learning as Instruction Data Prospector for Large Language Models [108.81681547472138]
textscNuggets uses one-shot learning to select high-quality instruction data from extensive datasets.
We show that instruction tuning with the top 1% of examples curated by textscNuggets substantially outperforms conventional methods employing the entire dataset.
arXiv Detail & Related papers (2023-12-16T03:33:12Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - On Learning Text Style Transfer with Direct Rewards [101.97136885111037]
Lack of parallel corpora makes it impossible to directly train supervised models for the text style transfer task.
We leverage semantic similarity metrics originally used for fine-tuning neural machine translation models.
Our model provides significant gains in both automatic and human evaluation over strong baselines.
arXiv Detail & Related papers (2020-10-24T04:30:02Z) - Selecting Informative Contexts Improves Language Model Finetuning [66.26521454263343]
We present a general fine-tuning method that we call information gain filtration.
During fine-tuning, a secondary learner selects informative examples and skips uninformative ones.
We show that our method has consistent improvement across datasets, fine-tuning tasks, and language model architectures.
arXiv Detail & Related papers (2020-05-01T02:01:18Z) - Syntax-aware Data Augmentation for Neural Machine Translation [76.99198797021454]
We propose a novel data augmentation strategy for neural machine translation.
We set sentence-specific probability for word selection by considering their roles in sentence.
Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset.
arXiv Detail & Related papers (2020-04-29T13:45:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.