ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models
- URL: http://arxiv.org/abs/2510.14077v1
- Date: Wed, 15 Oct 2025 20:33:08 GMT
- Title: ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models
- Authors: Haziq Mohammad Khalid, Athikash Jeyaganthan, Timothy Do, Yicheng Fu, Sean O'Brien, Vasu Sharma, Kevin Zhu,
- Abstract summary: Large Language Models (LLMs) suffer significant performance degradation in multi-turn conversations when information is presented incrementally.<n>We introduce ERGO, which quantifies internal uncertainty via Shannon entropy over next token distributions.<n> ERGO embraces variability in language and modeling, representing and responding to uncertainty.
- Score: 9.509429747913519
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) suffer significant performance degradation in multi-turn conversations when information is presented incrementally. Given that multi-turn conversations characterize everyday interactions with LLMs, this degradation poses a severe challenge to real world usability. We hypothesize that abrupt increases in model uncertainty signal misalignment in multi-turn LLM interactions, and we exploit this insight to dynamically realign conversational context. We introduce ERGO (Entropy-guided Resetting for Generation Optimization), which continuously quantifies internal uncertainty via Shannon entropy over next token distributions and triggers adaptive prompt consolidation when a sharp spike in entropy is detected. By treating uncertainty as a first class signal rather than a nuisance to eliminate, ERGO embraces variability in language and modeling, representing and responding to uncertainty. In multi-turn tasks with incrementally revealed instructions, ERGO yields a 56.6% average performance gain over standard baselines, increases aptitude (peak performance capability) by 24.7%, and decreases unreliability (variability in performance) by 35.3%, demonstrating that uncertainty aware interventions can improve both accuracy and reliability in conversational AI.
Related papers
- Reinforcement Inference: Leveraging Uncertainty for Self-Correcting Language Model Reasoning [0.0]
Reinforcement Inference uses the model's own uncertainty to selectively invoke a second, more deliberate reasoning attempt.<n>On 12,032 MMLU-Pro questions across 14 subjects, using DeepSeek-v3.2 with deterministic decoding in a zero-shot setting, Reinforcement Inference improves accuracy from 60.72% to 84.03%.
arXiv Detail & Related papers (2026-02-09T11:08:24Z) - Structured Uncertainty guided Clarification for LLM Agents [126.26213027785813]
LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures.<n>We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy.<n>Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$times$ compared to strong prompting and uncertainty-based baselines.
arXiv Detail & Related papers (2025-11-11T21:50:44Z) - DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z) - Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models [74.15250326312179]
Diffusion Large Language Models offer efficient parallel generation and capable global modeling.<n>The dominant application ofDLLMs is hindered by the need for a statically predefined generation length.<n>We introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion.
arXiv Detail & Related papers (2025-08-01T17:56:07Z) - Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation [0.3495246564946556]
Large Language Models (LLMs) have achieved remarkable success in tasks requiring complex reasoning.<n>Do these models truly reason, or do they merely exploit shallow statistical patterns?<n>We introduce Chain-of-Code Collapse, where we investigate the robustness of reasoning LLMs by introducing a suite of semantically faithful yet adversarially structured prompt perturbations.
arXiv Detail & Related papers (2025-06-08T02:43:46Z) - Evaluating the Sensitivity of LLMs to Prior Context [2.377922603550519]
Large language models (LLMs) are increasingly deployed in multi-turn dialogue and other sustained interactive scenarios.<n>We introduce a novel set of benchmarks that vary the volume and nature of prior context to measure sensitivity to contextual variations.<n>Our findings reveal that LLM performance on multiple-choice questions can degrade dramatically in multi-turn interactions.
arXiv Detail & Related papers (2025-05-29T16:09:32Z) - SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization [57.69385990442078]
Large language models (LLMs) exhibit varying levels of confidence across input prompts (questions)<n>Semantic entropy measures the diversity of meaning in multiple generated answers given a prompt and uses this to modulate the magnitude of policy updates.
arXiv Detail & Related papers (2025-05-18T10:20:59Z) - Out of Style: RAG's Fragility to Linguistic Variation [29.59506089890902]
User queries exhibit greater linguistic variations and can trigger cascading errors across interdependent RAG components.<n>We analyze how varying four linguistic dimensions (formality, readability, politeness, and grammatical correctness) impact RAG performance.
arXiv Detail & Related papers (2025-04-11T03:30:26Z) - Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness [9.208007322096535]
We present Latent Adversarial Paraphrasing (LAP), a dual-loop adversarial framework.<n>LAP trains a learnable perturbation to serve as a "latent continuous paraphrase"<n>We conduct experiments to demonstrate the effectiveness of LAP across multiple LLM architectures.
arXiv Detail & Related papers (2025-03-03T09:36:50Z) - Auto-Prompt Generation is Not Robust: Prompt Optimization Driven by Pseudo Gradient [50.15090865963094]
We introduce PertBench, a comprehensive benchmark dataset that includes a wide range of input perturbations.<n>Our analysis reveals substantial vulnerabilities in existing prompt generation strategies.<n>We propose PGO, a gradient-free prompt generation framework that leverages perturbation types as pseudo-gradient signals.
arXiv Detail & Related papers (2024-12-24T06:05:08Z) - UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation [93.38604803625294]
We present UncertaintyRAG, a novel approach for long-context Retrieval-Augmented Generation (RAG)
We use Signal-to-Noise Ratio (SNR)-based span uncertainty to estimate similarity between text chunks.
UncertaintyRAG outperforms baselines by 2.03% on LLaMA-2-7B, achieving state-of-the-art results.
arXiv Detail & Related papers (2024-10-03T17:39:38Z) - On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts.
We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries.
Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.