Related papers: The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

URL: http://arxiv.org/abs/2505.15134v1
Date: Wed, 21 May 2025 05:39:11 GMT
Title: The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
Authors: Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, Hao Peng,
Abstract summary: Entropy minimization (EM) trains the model to concentrate even more probability mass on its most confident outputs.<n>We show that this simple objective alone, without any labeled data, can substantially improve large language models' performance on challenging math, physics, and coding tasks.
Score: 44.988290766092184
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Entropy minimization (EM) trains the model to concentrate even more probability mass on its most confident outputs. We show that this simple objective alone, without any labeled data, can substantially improve large language models' (LLMs) performance on challenging math, physics, and coding tasks. We explore three approaches: (1) EM-FT minimizes token-level entropy similarly to instruction finetuning, but on unlabeled outputs drawn from the model; (2) EM-RL: reinforcement learning with negative entropy as the only reward to maximize; (3) EM-INF: inference-time logit adjustment to reduce entropy without any training data or parameter updates. On Qwen-7B, EM-RL, without any labeled data, achieves comparable or better performance than strong RL baselines such as GRPO and RLOO that are trained on 60K labeled examples. Furthermore, EM-INF enables Qwen-32B to match or exceed the performance of proprietary models like GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on the challenging SciCode benchmark, while being 3x more efficient than self-consistency and sequential refinement. Our findings reveal that many pretrained LLMs possess previously underappreciated reasoning capabilities that can be effectively elicited through entropy minimization alone, without any labeled data or even any parameter updates.

Related papers

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models [15.218318229687242]
Extreme activation outliers in Large Language Models critically degrade quantization performance.<n>We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents formation.<n>Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies.
arXiv Detail & Related papers (2025-06-24T15:03:57Z)
Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z)
ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining [53.893792844055106]
Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency.<n>We introduce Selective Efficient Language Modeling, a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection.<n> Experiments on GPT-2 pretraining show that ESLM significantly reduces training FLOPs while maintaining or improving both perplexity and downstream performance compared to baselines.
arXiv Detail & Related papers (2025-05-26T12:23:26Z)
EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.<n>We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z)
Preserving Diversity in Supervised Fine-Tuning of Large Language Models [29.02934952075354]
This paper introduces a new game-theoretic formulation forSupervised Fine-Tuning (SFT)<n>In this framework, an auxiliary variable is introduced to regulate the learning process.<n>We prove that the proposed game-theoretic approach connects to the problem of reverse KL minimization with entropy regularization.
arXiv Detail & Related papers (2024-08-29T16:21:00Z)
Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters. In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z)
TernaryLLM: Ternarized Large Language Model [29.29122031050894]
Large language models (LLMs) have achieved remarkable performance on Natural Language Processing (NLP) tasks. We introduce Dual Learnable Ternarization (DLT), which enables both scales and shifts to be learnable. We also propose Outlier-Friendly Feature Knowledge Distillation (OFF) to recover the information lost in extremely low-bit quantization.
arXiv Detail & Related papers (2024-06-11T11:40:12Z)
PHUDGE: Phi-3 as Scalable Judge [1.7495213911983414]
We present Phi3 model that achieved SOTA results in 4 tasks as Feedback Test, Feedback OOD, MT Human, Preference Test. It shows very strong correlation not only with GPT4 but with Human annotators too in unseen data as well as in both absolute and relative grading tasks. We show that by following systematic ML experimentation, thoughtful data augmentation and reposing the problem itself, we can even beat 10x bigger models even with lesser training data.
arXiv Detail & Related papers (2024-05-12T18:22:16Z)
Uncertainty-aware Parameter-Efficient Self-training for Semi-supervised Language Understanding [38.11411155621616]
We study self-training as one of the predominant semi-supervised learning approaches. We present UPET, a novel Uncertainty-aware self-Training framework. We show that UPET achieves a substantial improvement in terms of performance and efficiency.
arXiv Detail & Related papers (2023-10-19T02:18:29Z)
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models. We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.