On the Calibration of Massively Multilingual Language Models
- URL: http://arxiv.org/abs/2210.12265v1
- Date: Fri, 21 Oct 2022 21:41:56 GMT
- Title: On the Calibration of Massively Multilingual Language Models
- Authors: Kabir Ahuja, Sunayana Sitaram, Sandipan Dandapat, Monojit Choudhury
- Abstract summary: Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer.
We first investigate the calibration of MMLMs in the zero-shot setting and observe a clear case of miscalibration in low-resource languages.
We also find that few-shot examples in the language can further help reduce the calibration errors, often substantially.
- Score: 15.373725507698591
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Massively Multilingual Language Models (MMLMs) have recently gained
popularity due to their surprising effectiveness in cross-lingual transfer.
While there has been much work in evaluating these models for their performance
on a variety of tasks and languages, little attention has been paid on how well
calibrated these models are with respect to the confidence in their
predictions. We first investigate the calibration of MMLMs in the zero-shot
setting and observe a clear case of miscalibration in low-resource languages or
those which are typologically diverse from English. Next, we empirically show
that calibration methods like temperature scaling and label smoothing do
reasonably well towards improving calibration in the zero-shot scenario. We
also find that few-shot examples in the language can further help reduce the
calibration errors, often substantially. Overall, our work contributes towards
building more reliable multilingual models by highlighting the issue of their
miscalibration, understanding what language and model specific factors
influence it, and pointing out the strategies to improve the same.
Related papers
- Investigating Language-Specific Calibration For Pruning Multilingual Large Language Models [11.421452042888523]
We compare different calibration languages for pruning multilingual models across diverse languages, tasks, models, and SotA pruning techniques.
Our results offer practical suggestions, for example, calibrating in the target language can efficiently retain the language modeling capability but does not necessarily benefit downstream tasks.
arXiv Detail & Related papers (2024-08-26T16:29:13Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency.
Results show that consistency-based calibration methods outperform existing post-hoc approaches.
We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z) - On the Calibration of Large Language Models and Alignment [63.605099174744865]
Confidence calibration serves as a crucial tool for gauging the reliability of deep models.
We conduct a systematic examination of the calibration of aligned language models throughout the entire construction process.
Our work sheds light on whether popular LLMs are well-calibrated and how the training process influences model calibration.
arXiv Detail & Related papers (2023-11-22T08:57:55Z) - On the Calibration of Multilingual Question Answering LLMs [57.296161186129545]
We benchmark the calibration of several multilingual Large Language Models (MLLMs) on a variety of Question Answering tasks.
We study different dimensions of calibration in in-distribution, out-of-distribution, and cross-lingual transfer settings.
For decoder-only LLMs such as LlaMa2, we additionally find that in-context learning improves confidence calibration on multilingual data.
arXiv Detail & Related papers (2023-11-15T03:29:02Z) - Evaluating the Capability of Large-scale Language Models on Chinese
Grammatical Error Correction Task [10.597024796304016]
Large-scale language models (LLMs) has shown remarkable capability in various of Natural Language Processing (NLP) tasks.
This report explores the how large language models perform on Chinese grammatical error correction tasks.
arXiv Detail & Related papers (2023-07-08T13:10:59Z) - Preserving Pre-trained Features Helps Calibrate Fine-tuned Language
Models [23.881825575095945]
Large pre-trained language models (PLMs) have demonstrated strong performance on natural language understanding (NLU) tasks through fine-tuning.
However, fine-tuned models still suffer from overconfident predictions, especially in out-of-domain settings.
We demonstrate that the PLMs are well-calibrated on the masked language modeling task with robust predictive confidence under domain shift.
We show that preserving pre-trained features can improve the calibration of fine-tuned language models.
arXiv Detail & Related papers (2023-05-30T17:35:31Z) - A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained
Models [87.7086269902562]
We show that subword-based models might still be the most practical choice in many settings.
We encourage future work in tokenizer-free methods to consider these factors when designing and evaluating new models.
arXiv Detail & Related papers (2022-10-13T15:47:09Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.