Related papers: Order Matters in the Presence of Dataset Imbalance for Multilingual Learning

Order Matters in the Presence of Dataset Imbalance for Multilingual Learning

URL: http://arxiv.org/abs/2312.06134v1
Date: Mon, 11 Dec 2023 05:46:57 GMT
Title: Order Matters in the Presence of Dataset Imbalance for Multilingual Learning
Authors: Dami Choi, Derrick Xin, Hamid Dadkhahi, Justin Gilmer, Ankush Garg, Orhan Firat, Chih-Kuan Yeh, Andrew M. Dai, Behrooz Ghorbani
Abstract summary: We present a simple yet effective method of pre-training on high-resource tasks, followed by fine-tuning on a mixture of high/low-resource tasks. We show its improvements in neural machine translation (NMT) and multi-lingual language modeling.
Score: 53.74649778447903
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we empirically study the optimization dynamics of multi-task learning, particularly focusing on those that govern a collection of tasks with significant data imbalance. We present a simple yet effective method of pre-training on high-resource tasks, followed by fine-tuning on a mixture of high/low-resource tasks. We provide a thorough empirical study and analysis of this method's benefits showing that it achieves consistent improvements relative to the performance trade-off profile of standard static weighting. We analyze under what data regimes this method is applicable and show its improvements empirically in neural machine translation (NMT) and multi-lingual language modeling.

Related papers

Improving Data and Parameter Efficiency of Neural Language Models Using Representation Analysis [0.0]
This thesis addresses challenges related to data and parameter efficiency in neural language models.<n>The first part examines the properties and dynamics of language representations within neural models, emphasizing their significance in enhancing robustness and generalization.<n>The second part focuses on methods to significantly enhance data and parameter efficiency by integrating active learning strategies with parameter-efficient fine-tuning.<n>The third part explores weak supervision techniques enhanced by in-context learning to effectively utilize unlabeled data.
arXiv Detail & Related papers (2025-07-16T07:58:20Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance. We introduce novel algorithms for dynamic, instance-level data reweighting. Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z)
Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning [104.27224674122313]
Fine-tuning MLLM has become a common practice to improve performance on specific downstream tasks. To balance the trade-off between generalization and specialization, we propose measuring the parameter importance for both pre-trained and fine-tuning distributions.
arXiv Detail & Related papers (2024-11-17T01:16:37Z)
Mitigating Training Imbalance in LLM Fine-Tuning via Selective Parameter Merging [11.223074654129915]
Supervised fine-tuning (SFT) is crucial for adapting Large Language Models (LLMs) to specific tasks. We propose to mitigate this imbalance by merging SFT models fine-tuned with different data orders.
arXiv Detail & Related papers (2024-10-01T08:44:31Z)
One-Shot Learning as Instruction Data Prospector for Large Language Models [108.81681547472138]
textscNuggets uses one-shot learning to select high-quality instruction data from extensive datasets. We show that instruction tuning with the top 1% of examples curated by textscNuggets substantially outperforms conventional methods employing the entire dataset.
arXiv Detail & Related papers (2023-12-16T03:33:12Z)
Multi-Task Learning with Summary Statistics [4.871473117968554]
We propose a flexible multi-task learning framework utilizing summary statistics from various sources. We also present an adaptive parameter selection approach based on a variant of Lepski's method. This work offers a more flexible tool for training related models across various domains, with practical implications in genetic risk prediction.
arXiv Detail & Related papers (2023-07-05T15:55:23Z)
On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification. We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned. Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z)
Learning Neural Models for Natural Language Processing in the Face of Distributional Shift [10.990447273771592]
The dominating NLP paradigm of training a strong neural predictor to perform one task on a specific dataset has led to state-of-the-art performance in a variety of applications. It builds upon the assumption that the data distribution is stationary, ie. that the data is sampled from a fixed distribution both at training and test time. This way of training is inconsistent with how we as humans are able to learn from and operate within a constantly changing stream of information. It is ill-adapted to real-world use cases where the data distribution is expected to shift over the course of a model's lifetime
arXiv Detail & Related papers (2021-09-03T14:29:20Z)
Multitask Learning for Class-Imbalanced Discourse Classification [74.41900374452472]
We show that a multitask approach can improve 7% Micro F1-score upon current state-of-the-art benchmarks. We also offer a comparative review of additional techniques proposed to address resource-poor problems in NLP.
arXiv Detail & Related papers (2021-01-02T07:13:41Z)
Fine-tuning BERT for Low-Resource Natural Language Understanding via Active Learning [30.5853328612593]
In this work, we explore fine-tuning methods of BERT -- a pre-trained Transformer based language model. Our experimental results show an advantage in model performance by maximizing the approximate knowledge gain of the model. We analyze the benefits of freezing layers of the language model during fine-tuning to reduce the number of trainable parameters.
arXiv Detail & Related papers (2020-12-04T08:34:39Z)
DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences. Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z)
Reinforced Curriculum Learning on Pre-trained Neural Machine Translation Models [20.976165305749777]
We learn a curriculum for improving a pre-trained NMT model by re-selecting influential data samples from the original training set. We propose a data selection framework based on Deterministic Actor-Critic, in which a critic network predicts the expected change of model performance.
arXiv Detail & Related papers (2020-04-13T03:40:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.