Related papers: On the Calibration of Multilingual Question Answering LLMs

On the Calibration of Multilingual Question Answering LLMs

URL: http://arxiv.org/abs/2311.08669v2
Date: Mon, 15 Apr 2024 14:44:04 GMT
Title: On the Calibration of Multilingual Question Answering LLMs
Authors: Yahan Yang, Soham Dan, Dan Roth, Insup Lee,
Abstract summary: We benchmark the calibration of several multilingual Large Language Models (MLLMs) on a variety of Question Answering tasks. We study different dimensions of calibration in in-distribution, out-of-distribution, and cross-lingual transfer settings. For decoder-only LLMs such as LlaMa2, we additionally find that in-context learning improves confidence calibration on multilingual data.
Score: 57.296161186129545
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multilingual pre-trained Large Language Models (LLMs) are incredibly effective at Question Answering (QA), a core task in Natural Language Understanding, achieving high accuracies on several multilingual benchmarks. However, little is known about how well their confidences are calibrated. In this paper, we comprehensively benchmark the calibration of several multilingual LLMs (MLLMs) on a variety of QA tasks. We perform extensive experiments, spanning encoder-only, encoder-decoder, and decoder-only QA models (size varying from 110M to 7B parameters) and diverse languages, including both high- and low-resource ones. We study different dimensions of calibration in in-distribution, out-of-distribution, and cross-lingual transfer settings, and investigate strategies to improve it, including post-hoc methods and regularized fine-tuning. For decoder-only LLMs such as LlaMa2, we additionally find that in-context learning improves confidence calibration on multilingual data. We also conduct several ablation experiments to study the effect of language distances, language corpus size, and model size on calibration, and how multilingual models compare with their monolingual counterparts for diverse tasks and languages. Our experiments suggest that the multilingual QA models are poorly calibrated for languages other than English and incorporating a small set of cheaply translated multilingual samples during fine-tuning/calibration effectively enhances the calibration performance.

Related papers

M-Prometheus: A Suite of Open Multilingual LLM Judges [64.22940792713713]
We introduce M-Prometheus, a suite of open-weight LLM judges that can provide both direct assessment and pairwise comparison feedback on multilingual outputs. M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation (MT) evaluation covering 4 language pairs.
arXiv Detail & Related papers (2025-04-07T11:37:26Z)
Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models [0.0]
The quality of tokenization can significantly impact a model's ability to handle diverse languages effectively. We introduce Qtok, a tool designed to assess tokenizer quality with a specific emphasis on their performance in multilingual contexts. Qtok applies these metrics to evaluate 13 distinct tokenizers from 58 publicly available models, analyzing their output across different linguistic contexts.
arXiv Detail & Related papers (2024-10-16T19:34:34Z)
Scaling Laws for Multilingual Language Models [41.6318470003173]
A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer. We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio. We derive a power-law relationship that links performance with dataset size, model size and sampling ratios.
arXiv Detail & Related papers (2024-10-15T20:29:38Z)
Investigating Language-Specific Calibration For Pruning Multilingual Large Language Models [11.421452042888523]
We compare different calibration languages for pruning multilingual models across diverse languages, tasks, models, and SotA pruning techniques. Our results offer practical suggestions, for example, calibrating in the target language can efficiently retain the language modeling capability but does not necessarily benefit downstream tasks.
arXiv Detail & Related papers (2024-08-26T16:29:13Z)
The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance. Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes. We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z)
Multilingual Brain Surgeon: Large Language Models Can be Compressed Leaving No Language Behind [14.433894552549337]
Large Language Models (LLMs) have ushered in a new era in Natural Language Processing, but their massive size demands effective compression techniques for practicality. This paper introduces Multilingual Brain Surgeon (MBS), a novel calibration data sampling method for multilingual LLMs compression. MBS overcomes the English-centric limitations of existing methods by sampling calibration data from various languages proportionally to the language distribution of the model training datasets.
arXiv Detail & Related papers (2024-04-06T22:16:32Z)
Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages. In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z)
On the Calibration of Massively Multilingual Language Models [15.373725507698591]
Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer. We first investigate the calibration of MMLMs in the zero-shot setting and observe a clear case of miscalibration in low-resource languages. We also find that few-shot examples in the language can further help reduce the calibration errors, often substantially.
arXiv Detail & Related papers (2022-10-21T21:41:56Z)
High-resource Language-specific Training for Multilingual Neural Machine Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference. Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder. HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z)
Breaking Down Multilingual Machine Translation [74.24795388967907]
We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs) Our many-to-one models for high-resource languages and one-to-many models for LRLs outperform the best results reported by Aharoni et al.
arXiv Detail & Related papers (2021-10-15T14:57:12Z)
Are Multilingual Models Effective in Code-Switching? [57.78477547424949]
We study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting. Our findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching.
arXiv Detail & Related papers (2021-03-24T16:20:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.