Mitigating Label Length Bias in Large Language Models
- URL: http://arxiv.org/abs/2511.14385v1
- Date: Tue, 18 Nov 2025 11:45:24 GMT
- Title: Mitigating Label Length Bias in Large Language Models
- Authors: Mario Sanz-Guerrero, Katharina von der Wense,
- Abstract summary: normalized contextual calibration (NCC) is an effective method that normalizes and calibrates predictions at the full-label level.<n>NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1.<n>Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates.
- Score: 16.378998802160375
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call label length bias, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose normalized contextual calibration (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.
Related papers
- Quantifying and Mitigating Selection Bias in LLMs: A Transferable LoRA Fine-Tuning and Efficient Majority Voting Approach [13.829059542429876]
Multiple Choice Question (MCQ) answering is a widely used method for evaluating the performance of Large Language Models (LLMs)<n>LLMs often exhibit selection bias in MCQ tasks, where their choices are influenced by factors like answer position or option symbols rather than the content.
arXiv Detail & Related papers (2025-11-17T21:31:37Z) - Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance [28.524573212179124]
Large language models (LLMs) offer new opportunities to enhance the annotation process.<n>We compare expert, crowd-sourced, and LLM-based annotations in terms of the agreement, label quality, and efficiency.<n>Our findings reveal a substantial number of label errors, which, when corrected, a significant upward shift in reported model performance.
arXiv Detail & Related papers (2024-10-24T16:27:03Z) - Dynamic Correlation Learning and Regularization for Multi-Label Confidence Calibration [60.95748658638956]
This paper introduces the Multi-Label Confidence task, aiming to provide well-calibrated confidence scores in multi-label scenarios.
Existing single-label calibration methods fail to account for category correlations, which are crucial for addressing semantic confusion.
We propose the Dynamic Correlation Learning and Regularization algorithm, which leverages multi-grained semantic correlations to better model semantic confusion.
arXiv Detail & Related papers (2024-07-09T13:26:21Z) - Beyond Performance: Quantifying and Mitigating Label Bias in LLMs [8.77694178599322]
We evaluate different approaches to quantifying label bias in a model's predictions.
Our investigation reveals substantial label bias in models both before and after debiasing attempts.
We propose a novel label bias calibration method tailored for few-shot prompting.
arXiv Detail & Related papers (2024-05-04T19:53:03Z) - Mitigating Label Biases for In-context Learning [28.209613730240633]
Various design settings for in-context learning (ICL) can bias a model toward a particular prediction without being reflective of an understanding of the task.
In this work, we define a typology for three types of label biases in ICL for text classification: vanilla-label bias, context-label bias, and domain-label bias.
arXiv Detail & Related papers (2023-05-28T15:37:39Z) - Class-Distribution-Aware Pseudo Labeling for Semi-Supervised Multi-Label
Learning [97.88458953075205]
Pseudo-labeling has emerged as a popular and effective approach for utilizing unlabeled data.
This paper proposes a novel solution called Class-Aware Pseudo-Labeling (CAP) that performs pseudo-labeling in a class-aware manner.
arXiv Detail & Related papers (2023-05-04T12:52:18Z) - SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised
Learning [101.86916775218403]
This paper revisits the popular pseudo-labeling methods via a unified sample weighting formulation.
We propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training.
In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.
arXiv Detail & Related papers (2023-01-26T03:53:25Z) - PLM: Partial Label Masking for Imbalanced Multi-label Classification [59.68444804243782]
Neural networks trained on real-world datasets with long-tailed label distributions are biased towards frequent classes and perform poorly on infrequent classes.
We propose a method, Partial Label Masking (PLM), which utilizes this ratio during training.
Our method achieves strong performance when compared to existing methods on both multi-label (MultiMNIST and MSCOCO) and single-label (imbalanced CIFAR-10 and CIFAR-100) image classification datasets.
arXiv Detail & Related papers (2021-05-22T18:07:56Z) - In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label
Selection Framework for Semi-Supervised Learning [53.1047775185362]
Pseudo-labeling (PL) is a general SSL approach that does not have this constraint but performs relatively poorly in its original formulation.
We argue that PL underperforms due to the erroneous high confidence predictions from poorly calibrated models.
We propose an uncertainty-aware pseudo-label selection (UPS) framework which improves pseudo labeling accuracy by drastically reducing the amount of noise encountered in the training process.
arXiv Detail & Related papers (2021-01-15T23:29:57Z) - Gradient Descent in RKHS with Importance Labeling [58.79085525115987]
We study importance labeling problem, in which we are given many unlabeled data.
We propose a new importance labeling scheme that can effectively select an informative subset of unlabeled data.
arXiv Detail & Related papers (2020-06-19T01:55:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.