Calibrated Language Models and How to Find Them with Label Smoothing
- URL: http://arxiv.org/abs/2508.00264v1
- Date: Fri, 01 Aug 2025 02:12:20 GMT
- Title: Calibrated Language Models and How to Find Them with Label Smoothing
- Authors: Jerry Huang, Peng Lu, Qiuhao Zeng,
- Abstract summary: Large language models (LLMs) can behave as more powerful interactive agents through improved instruction-following ability.<n>Recent advances in natural language processing (NLP) have opened up greater opportunities to enable fine-tuned large language models (LLMs) to behave as more powerful interactive agents through improved instruction-following ability.<n> understanding how this impacts confidence calibration for reliable model output has not been researched in full.
- Score: 4.038344945496787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in natural language processing (NLP) have opened up greater opportunities to enable fine-tuned large language models (LLMs) to behave as more powerful interactive agents through improved instruction-following ability. However, understanding how this impacts confidence calibration for reliable model output has not been researched in full. In this work, we examine various open-sourced LLMs, identifying significant calibration degradation after instruction tuning in each. Seeking a practical solution, we look towards label smoothing, which has been shown as an effective method to regularize for overconfident predictions but has yet to be widely adopted in the supervised fine-tuning (SFT) of LLMs. We first provide insight as to why label smoothing is sufficient to maintain calibration throughout the SFT process. However, settings remain where the effectiveness of smoothing is severely diminished, in particular the case of large vocabulary LLMs (LV-LLMs). We posit the cause to stem from the ability to become over-confident, which has a direct relationship with the hidden size and vocabulary size, and justify this theoretically and experimentally. Finally, we address an outstanding issue regarding the memory footprint of the cross-entropy loss computation in the label smoothed loss setting, designing a customized kernel to dramatically reduce memory consumption without sacrificing speed or performance in comparison to existing solutions for non-smoothed losses.
Related papers
- Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission [87.68447072141402]
Hybrid Language Models (HLMs) combine the low-latency efficiency of Small Language Models (SLMs) on edge devices with the high accuracy of Large Language Models (LLMs) on centralized servers.<n>We propose FedHLM, a communication-efficient HLM framework that integrates uncertainty-aware inference with Federated Learning (FL)
arXiv Detail & Related papers (2025-06-30T02:56:11Z) - SDMPrune: Self-Distillation MLP Pruning for Efficient Large Language Models [3.962074007736394]
We introduce a self-distillation loss during the pruning phase (rather than post-training) to fully exploit the predictions of the original model.<n>We demonstrate that our method significantly outperforms existing pruning methods.<n>Our method achieves very competitive performance among 1B-scale open source LLMs.
arXiv Detail & Related papers (2025-06-10T02:24:32Z) - An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits [0.0]
Post-training quantization reduces memory and computation but often degrades accuracy.<n>Pushing quantization to the ternary (2-bit) regime yields even larger savings but is notoriously unstable.<n>Our approach matches or surpasses more elaborate knowledge-distillation pipelines on standard language-modeling benchmarks.
arXiv Detail & Related papers (2025-05-12T21:14:29Z) - O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models [17.56932003351322]
Test-time prompt tuning for vision-language models (VLMs) is getting attention because of their ability to learn with unlabeled data without fine-tuning.<n>The resulting models tend to demonstrate poor calibration, which casts doubts on the reliability and trustworthiness of these models.<n>We propose a new approach, called O-TPT, that introduces orthogonality constraints on the textual features corresponding to the learnable prompts.
arXiv Detail & Related papers (2025-03-15T11:45:54Z) - Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output [49.893971654861424]
We present a light-weight approach for detecting nonfactual outputs from retrieval-augmented generation (RAG)
We compute a factuality score that can be thresholded to yield a binary decision.
Our experiments show high area under the ROC curve (AUC) across a wide range of relevant open source datasets.
arXiv Detail & Related papers (2024-11-01T20:44:59Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [63.10833446782114]
As language models grow in size, memory demands for backpropagation increase.<n>Zeroth-order (ZO) optimization methods offer a memory-efficient alternative.<n>In this paper, we propose Subspace Zero-order optimization to address the challenges posed by posed by high dimensionality perturbations.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Geometric-Averaged Preference Optimization for Soft Preference Labels [78.2746007085333]
Many algorithms for aligning LLMs with human preferences assume that human preferences are binary and deterministic.<n>In this work, we introduce the distributional soft preference labels and improve Direct Preference Optimization (DPO) with a weighted geometric average of the LLM output likelihood in the loss function.
arXiv Detail & Related papers (2024-09-10T17:54:28Z) - Get Confused Cautiously: Textual Sequence Memorization Erasure with Selective Entropy Maximization [17.20276556057748]
Large Language Models (LLMs) have been found to memorize and recite some of the textual sequences from their training set verbatim.
This Textual Sequence Memorization (TSM) phenomenon leads to a high demand to regulate LLM output to prevent it from generating certain memorized text.
Existing methods for TSM erasure fail to forget massive memorized samples without substantially jeopardizing the model utility.
arXiv Detail & Related papers (2024-08-09T10:26:11Z) - Smooth Pseudo-Labeling [4.1569253650826195]
A fruitful method in Semi-Supervised Learning (SSL) is Pseudo-Labeling (PL)
PL suffers from the important drawback that the associated loss function has discontinuities in its derivatives, which cause instabilities in performance when labels are very scarce.
We introduce a new benchmark, where labeled images are selected randomly from the whole dataset, without imposing representation of each class proportional to its frequency in the dataset.
arXiv Detail & Related papers (2024-05-23T08:33:07Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z) - Mitigating Biases of Large Language Models in Stance Detection with Counterfactual Augmented Calibration [43.02857908228108]
Large language models (LLMs) have demonstrated significant advancements across various natural language processing tasks including stance detection.<n>Their performance in stance detection is limited by biases and spurious correlations inherent due to their data-driven nature.<n>We propose a Counterfactual Augmented Network (FACTUAL), which a novel calibration network is devised to calibrate potential bias in the stance prediction of LLMs.
arXiv Detail & Related papers (2024-02-22T05:17:49Z) - Semi-Supervised Class-Agnostic Motion Prediction with Pseudo Label
Regeneration and BEVMix [59.55173022987071]
We study the potential of semi-supervised learning for class-agnostic motion prediction.
Our framework adopts a consistency-based self-training paradigm, enabling the model to learn from unlabeled data.
Our method exhibits comparable performance to weakly and some fully supervised methods.
arXiv Detail & Related papers (2023-12-13T09:32:50Z) - One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models [42.95555008229016]
We propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining.
The advantages of the proposed method exhibit even more when the sparsity is extremely high.
arXiv Detail & Related papers (2023-10-14T05:43:09Z) - Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM
Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models.
We propose a soft prompt learning method where we expose the compressed model to the prompt learning process.
Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z) - SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised
Learning [101.86916775218403]
This paper revisits the popular pseudo-labeling methods via a unified sample weighting formulation.
We propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training.
In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.
arXiv Detail & Related papers (2023-01-26T03:53:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.