Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning
- URL: http://arxiv.org/abs/2601.01362v1
- Date: Sun, 04 Jan 2026 04:29:12 GMT
- Title: Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning
- Authors: Jerry Huang, Peng Lu, Qiuhao Zeng, Yusuke Iwasawa, Yutaka Matsuo, Sarath Chandar, Edison Marrese-Taylor, Irene Li,
- Abstract summary: This work looks at a critical gap in the calibration of large language models (LLMs) within multilingual settings.<n>Even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets.<n>However, improvements in accuracy are marginal or non-existent, highlighting a critical shortcoming of standard SFT for multilingual languages.
- Score: 58.355275813623685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ensuring that deep learning models are well-calibrated in terms of their predictive uncertainty is essential in maintaining their trustworthiness and reliability, yet despite increasing advances in foundation model research, the relationship between such large language models (LLMs) and their calibration remains an open area of research. In this work, we look at a critical gap in the calibration of LLMs within multilingual settings, in an attempt to better understand how the data scarcity can potentially lead to different calibration effects and how commonly used techniques can apply in these settings. Our analysis on two multilingual benchmarks, over 29 and 42 languages respectively, reveals that even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets. However, improvements in accuracy are marginal or non-existent, resulting in mis-calibration, highlighting a critical shortcoming of standard SFT for multilingual languages. Furthermore, we observe that the use of label smoothing to be a reasonable method alleviate this concern, again without any need for low-resource SFT data, maintaining better calibration across all languages. Overall, this highlights the importance of multilingual considerations for both training and tuning LLMs in order to improve their reliability and fairness in downstream use.
Related papers
- Calibrating Beyond English: Language Diversity for Better Quantized Multilingual LLM [10.689556615369272]
Non-English and multilingual calibration sets significantly improve perplexity compared to English-only baselines.<n> tailoring calibration sets to the evaluation language yields the largest improvements for individual languages.
arXiv Detail & Related papers (2026-01-26T09:36:03Z) - Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models [50.34755385896279]
Confidence calibration is crucial for the reliable deployment of Large Language Models (LLMs)<n>We conduct the first large-scale, systematic studies of multilingual calibration across six model families and over 100 languages.<n>We find that non-English languages suffer from systematically worse calibration.
arXiv Detail & Related papers (2025-10-03T16:07:15Z) - Can Large Language Models Express Uncertainty Like Human? [71.27418419522884]
We release the first diverse, large-scale dataset of hedging expressions with human-annotated confidence scores.<n>We conduct the first systematic study of linguistic confidence across modern large language models.
arXiv Detail & Related papers (2025-09-29T02:34:30Z) - The Reliability Paradox: Exploring How Shortcut Learning Undermines Language Model Calibration [5.616884466478886]
Pre-trained language models (PLMs) have enabled significant performance gains in the field of natural language processing.<n>Recent studies have found PLMs to suffer from miscalibration, indicating a lack of accuracy in the confidence estimates provided by these models.<n>This paper investigates whether lower calibration error implies reliable decision rules for a language model.
arXiv Detail & Related papers (2024-12-17T08:04:28Z) - A Comparative Study of Translation Bias and Accuracy in Multilingual Large Language Models for Cross-Language Claim Verification [1.566834021297545]
This study systematically evaluates translation bias and the effectiveness of Large Language Models for cross-lingual claim verification.
We investigate two distinct translation methods: pre-translation and self-translation.
Our findings reveal that low-resource languages exhibit significantly lower accuracy in direct inference due to underrepresentation.
arXiv Detail & Related papers (2024-10-14T09:02:42Z) - Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation [62.202893186343935]
We explore what it would take to adapt Large Language Models for low-resource languages.
We show that parallel data is critical during both pre-training andSupervised Fine-Tuning (SFT)
Our experiments with three LLMs across two low-resourced language groups reveal consistent trends, underscoring the generalizability of our findings.
arXiv Detail & Related papers (2024-08-23T00:59:38Z) - On the Calibration of Multilingual Question Answering LLMs [57.296161186129545]
We benchmark the calibration of several multilingual Large Language Models (MLLMs) on a variety of Question Answering tasks.
We study different dimensions of calibration in in-distribution, out-of-distribution, and cross-lingual transfer settings.
For decoder-only LLMs such as LlaMa2, we additionally find that in-context learning improves confidence calibration on multilingual data.
arXiv Detail & Related papers (2023-11-15T03:29:02Z) - On the Calibration of Massively Multilingual Language Models [15.373725507698591]
Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer.
We first investigate the calibration of MMLMs in the zero-shot setting and observe a clear case of miscalibration in low-resource languages.
We also find that few-shot examples in the language can further help reduce the calibration errors, often substantially.
arXiv Detail & Related papers (2022-10-21T21:41:56Z) - On the Inference Calibration of Neural Machine Translation [54.48932804996506]
We study the correlation between calibration and translation performance and linguistic properties of miscalibration.
We propose a new graduated label smoothing method that can improve both inference calibration and translation performance.
arXiv Detail & Related papers (2020-05-03T02:03:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.