Related papers: Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?

Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?

URL: http://arxiv.org/abs/2406.12822v2
Date: Thu, 11 Jul 2024 16:37:24 GMT
Title: Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?
Authors: Pinzhen Chen, Simon Yu, Zhicheng Guo, Barry Haddow,
Abstract summary: Large language models, particularly multilingual ones, are designed, claimed, and expected to cater to native speakers of varied languages. We hypothesise that the current practices of fine-tuning and evaluating these models may not perfectly align with this objective owing to a heavy reliance on translation. It remains unknown whether the nature of the instruction data has an impact on the model output; conversely, it is questionable whether translated test sets can capture such nuances.
Score: 17.011882550422452
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models, particularly multilingual ones, are designed, claimed, and expected to cater to native speakers of varied languages. We hypothesise that the current practices of fine-tuning and evaluating these models may not perfectly align with this objective owing to a heavy reliance on translation, which can introduce translation artefacts and defects. It remains unknown whether the nature of the instruction data has an impact on the model output; conversely, it is questionable whether translated test sets can capture such nuances. Due to the often coupled practices of using translated data in both stages, such imperfections could have been overlooked. This work investigates these issues using controlled native or translated data during instruction tuning and evaluation stages. Experiments on eight base models and eight different benchmarks show that native or generation benchmarks reveal a notable difference between native and translated instruction data especially when model performance is high, whereas other types of test sets cannot. The comparison between round-trip and single-pass translations reflects the importance of knowledge from language-native resources. Finally, we demonstrate that regularization is beneficial to bridging this gap on structured but not generative tasks.

Related papers

Adapting Language Models to Indonesian Local Languages: An Empirical Study of Language Transferability on Zero-Shot Settings [1.1556013985948772]
We evaluate transferability of pre-trained language models to low-resource Indonesian local languages.<n>We group the target languages into three categories: seen, partially seen, and unseen.<n> Multilingual models perform best on seen languages, moderately on partially seen ones, and poorly on unseen languages.<n>We find that MAD-X significantly improves performance, especially for seen and partially seen languages, without requiring labeled data in the target language.
arXiv Detail & Related papers (2025-07-02T12:17:55Z)
Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training [58.696660064190475]
We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching.
arXiv Detail & Related papers (2025-04-02T15:09:58Z)
X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions [43.90353059292894]
Large language models respond well in high-resource languages like English but struggle in low-resource languages. We propose a novel method to construct cross-lingual instruction following samples with instruction in English and response in low-resource languages.
arXiv Detail & Related papers (2024-05-30T06:45:23Z)
Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning [26.49647954587193]
In this work, we find that translation inconsistencies do exist and they disproportionally impact low-resource languages in XNLI. To identify such inconsistencies, we propose measuring the gap in performance between zero-shot evaluations on the human-translated and machine-translated target text. We also corroborate that translation errors exist for two target languages, namely Hindi and Urdu, by doing a manual reannotation of human-translated test instances.
arXiv Detail & Related papers (2024-02-03T08:22:51Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
Cross-Lingual Fine-Grained Entity Typing [26.973783464706447]
We present a unified cross-lingual fine-grained entity typing model capable of handling over 100 languages. We analyze this model's ability to generalize to languages and entities unseen during training.
arXiv Detail & Related papers (2021-10-15T03:22:30Z)
Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages. We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z)
When Does Translation Require Context? A Data-driven, Multilingual Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT) Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation. We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z)
Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict. This work shows a comparison of a neural model and character language models with varying amounts on target language data. Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation [82.96358326053115]
We investigate sensitivity of probing task results to structural design choices. We probe embeddings in a multilingual setup with design choices that lie in a'stable region', as we identify for English. We find that results on English do not transfer to other languages.
arXiv Detail & Related papers (2020-06-16T12:37:50Z)
On the Importance of Word Order Information in Cross-lingual Sequence Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages. We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.