Related papers: Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models

Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models

URL: http://arxiv.org/abs/2506.09408v1
Date: Wed, 11 Jun 2025 05:33:56 GMT
Title: Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models
Authors: Jui-Ming Yao, Hao-Yuan Chen, Zi-Xian Tang, Bing-Jia Tan, Sheng-Wei Peng, Bing-Cheng Xie, Shun-Feng Su,
Abstract summary: We introduce and evaluate Token Constraint Decoding (TCD)<n>This simple yet effective inference-time algorithm enforces alignment between token-level predictions to enhance robustness in noisy settings.<n>Our findings establish TCD as a practical, model-agnostic approach for improving reasoning stability under real-world imperfections.
Score: 4.078176555898098
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have demonstrated impressive performance on multiple-choice question answering (MCQA) benchmarks, yet they remain highly vulnerable to minor input perturbations. In this paper, we introduce and evaluate Token Constraint Decoding (TCD). This simple yet effective inference-time algorithm enforces alignment between token-level predictions to enhance robustness in noisy settings. Through extensive experiments on CommonsenseQA, MMLU, and MMLU-Pro, we show that TCD, especially when paired with prompt engineering (PE) fixes, significantly restores performance degraded by input noise, yielding up to +39\% absolute gains for weaker models like Gemma3 1B. Penalty sweep analyses further reveal that TCD implicitly regularizes overconfident outputs, with different models requiring distinct penalty schedules to maximize resilience. Our findings establish TCD as a practical, model-agnostic approach for improving reasoning stability under real-world imperfections and pave the way for more reliable deployment of LLMs in safety-critical or user-facing applications.

Related papers

ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Fragility in LLMs [16.266957200961908]
Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Large Language Models (LLMs)<n>Errors introduced in the later stages of a CoT chain are significantly more likely to corrupt the final answer than identical errors made at the beginning.<n>We introduce the Adaptive Self-Correction Chain-of-Thought (ASCoT) method to address this specific vulnerability.
arXiv Detail & Related papers (2025-08-07T11:26:40Z)
Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following [10.119219532863767]
lazy reasoning during the thinking stage is the primary factor contributing to poor instruction adherence.<n>We propose a comprehensive framework designed to enable rigorous reasoning processes involving preview and self-checking.<n>Our Light-IF-32B model surpasses both larger open-source models such as DeepSeek-R1 and closed-source models like Doubao-1.6.
arXiv Detail & Related papers (2025-08-05T07:42:00Z)
R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [60.37610817226533]
Chain-of-thought (CoT) reasoning encourages step-by-step intermediate reasoning during inference.<n>CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences.<n>We present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference.
arXiv Detail & Related papers (2025-07-23T08:14:36Z)
Break-The-Chain: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation [0.3495246564946556]
We introduce a suite of semantically faithful yet adversarially structured prompt perturbations.<n>We observe that while certain modifications severely degrade performance, others surprisingly improve model accuracy by up to 35.3%.<n>These findings expose the fragility and unpredictability of current reasoning systems.
arXiv Detail & Related papers (2025-06-08T02:43:46Z)
Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z)
ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning [75.1101108949743]
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting.<n>LRMs often suffer from verbose outputs caused by redundant content, increasing computational overhead, and degrading user experience.<n>We propose ConCISE, a framework that simplifies reasoning chains by reinforcing the model's confidence during inference.
arXiv Detail & Related papers (2025-05-08T01:40:40Z)
Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z)
Exploring LLM Reasoning Through Controlled Prompt Variations [0.9217021281095907]
We evaluate how well state-of-the-art models maintain logical consistency and correctness when confronted with four categories of prompt perturbations.<n>Our experiments, conducted on thirteen open-source and closed-source LLMs, reveal that introducing irrelevant context within the model's context window significantly degrades performance.<n>Certain perturbations inadvertently trigger chain-of-thought-like reasoning behaviors, even without explicit prompting.
arXiv Detail & Related papers (2025-04-02T20:18:50Z)
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching [60.04718679054704]
Chain-of-Thought prompting elicits step-by-step problem solving, but often at the cost of excessive verbosity in intermediate outputs.<n>We propose Sketch-of-Thought (SoT), a prompting framework that integrates cognitively inspired reasoning paradigms with linguistic constraints.<n>SoT achieves token reductions of up to 78% with minimal accuracy loss across 15 reasoning datasets.
arXiv Detail & Related papers (2025-03-07T06:57:17Z)
Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [48.15636223774418]
Large language models (LLMs) frequently hallucinate due to misaligned self-awareness.<n>Existing approaches mitigate hallucinations via uncertainty estimation or query rejection.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems.
arXiv Detail & Related papers (2025-03-04T03:16:02Z)
Adversarial Robustness Overestimation and Instability in TRADES [4.063518154926961]
TRADES sometimes yields disproportionately high PGD validation accuracy compared to the AutoAttack testing accuracy in the multiclass classification task. This discrepancy highlights a significant overestimation of robustness for these instances, potentially linked to gradient masking.
arXiv Detail & Related papers (2024-10-10T07:32:40Z)
Advancing the Robustness of Large Language Models through Self-Denoised Smoothing [50.54276872204319]
Large language models (LLMs) have achieved significant success, but their vulnerability to adversarial perturbations has raised considerable concerns. We propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions. Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility.
arXiv Detail & Related papers (2024-04-18T15:47:00Z)
Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability. In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling. Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.