DeMuX: Data-efficient Multilingual Learning
- URL: http://arxiv.org/abs/2311.06379v1
- Date: Fri, 10 Nov 2023 20:09:08 GMT
- Title: DeMuX: Data-efficient Multilingual Learning
- Authors: Simran Khanuja, Srinivas Gowriraj, Lucio Dery, Graham Neubig
- Abstract summary: DEMUX is a framework that prescribes exact data-points to label from vast amounts of unlabelled multilingual data.
Our end-to-end framework is language-agnostic, accounts for model representations, and supports multilingual target configurations.
- Score: 57.37123046817781
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the task of optimally fine-tuning pre-trained multilingual
models, given small amounts of unlabelled target data and an annotation budget.
In this paper, we introduce DEMUX, a framework that prescribes the exact
data-points to label from vast amounts of unlabelled multilingual data, having
unknown degrees of overlap with the target set. Unlike most prior works, our
end-to-end framework is language-agnostic, accounts for model representations,
and supports multilingual target configurations. Our active learning strategies
rely upon distance and uncertainty measures to select task-specific neighbors
that are most informative to label, given a model. DeMuX outperforms strong
baselines in 84% of the test cases, in the zero-shot setting of disjoint source
and target language sets (including multilingual target pools), across three
models and four tasks. Notably, in low-budget settings (5-100 examples), we
observe gains of up to 8-11 F1 points for token-level tasks, and 2-5 F1 for
complex tasks. Our code is released here:
https://github.com/simran-khanuja/demux.
Related papers
- Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - XSemPLR: Cross-Lingual Semantic Parsing in Multiple Natural Languages
and Meaning Representations [25.50509874992198]
Cross-Lingual Semantic Parsing aims to translate queries in multiple natural languages into meaning representations.
Existing CLSP models are separately proposed and evaluated on datasets of limited tasks and applications.
We present XSemPLR, a unified benchmark for cross-lingual semantic parsing featured with 22 natural languages and 8 meaning representations.
arXiv Detail & Related papers (2023-06-07T01:09:37Z) - Efficient Spoken Language Recognition via Multilabel Classification [53.662747523872305]
We show that our models obtain competitive results while being orders of magnitude smaller and faster than current state-of-the-art methods.
Our multilabel strategy is more robust to unseen non-target languages compared to multiclass classification.
arXiv Detail & Related papers (2023-06-02T23:04:19Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Nearest Neighbour Few-Shot Learning for Cross-lingual Classification [2.578242050187029]
Cross-lingual adaptation using a simple nearest neighbor few-shot (15 samples) inference technique for classification tasks.
Our approach consistently improves traditional fine-tuning using only a handful of labeled samples in target locales.
arXiv Detail & Related papers (2021-09-06T03:18:23Z) - MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing
Benchmark [31.91964553419665]
We present a new multilingual dataset, called MTOP, comprising of 100k annotated utterances in 6 languages across 11 domains.
We achieve an average improvement of +6.3 points on Slot F1 for the two existing multilingual datasets, over best results reported in their experiments.
We demonstrate strong zero-shot performance using pre-trained models combined with automatic translation and alignment, and a proposed distant supervision method to reduce the noise in slot label projection.
arXiv Detail & Related papers (2020-08-21T07:02:11Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z) - XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models.
XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z) - Zero-Shot Cross-Lingual Transfer with Meta Learning [45.29398184889296]
We consider the setting of training models on multiple languages at the same time, when little or no data is available for languages other than English.
We show that this challenging setup can be approached using meta-learning.
We experiment using standard supervised, zero-shot cross-lingual, as well as few-shot cross-lingual settings for different natural language understanding tasks.
arXiv Detail & Related papers (2020-03-05T16:07:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.