Related papers: Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

URL: http://arxiv.org/abs/2509.06861v1
Date: Mon, 08 Sep 2025 16:28:25 GMT
Title: Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet
Authors: James Xu Zhao, Bryan Hooi, See-Kiong Ng,
Abstract summary: Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains.<n>We show that this approach is not yet effective for knowledge-intensive tasks, where high factual accuracy and low hallucination rates are essential.<n>Our results reveal that increasing test-time computation does not consistently improve accuracy and, in many cases, it even leads to more hallucinations.
Score: 93.00109641811788
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has shown strong performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks, where high factual accuracy and low hallucination rates are essential. We conduct a comprehensive evaluation of test-time scaling using 12 reasoning models on two knowledge-intensive benchmarks. Our results reveal that increasing test-time computation does not consistently improve accuracy and, in many cases, it even leads to more hallucinations. We then analyze how extended reasoning affects hallucination behavior. We find that reduced hallucinations often result from the model choosing to abstain after thinking more, rather than from improved factual recall. Conversely, for some models, longer reasoning encourages attempts on previously unanswered questions, many of which result in hallucinations. Case studies show that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Despite these limitations, we observe that compared to non-thinking, enabling thinking remains beneficial. Code and data are available at https://github.com/XuZhao0/tts-knowledge

Related papers

Why Language Models Hallucinate [29.666976858078073]
Large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty.<n>Such "hallucinations" persist even in state-of-the-art systems and undermine trust.<n>We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty.
arXiv Detail & Related papers (2025-09-04T21:26:31Z)
Does Thinking More always Help? Mirage of Test-Time Scaling in Reasoning Models [130.5487886246353]
Extending thinking traces using prompts like "Wait" or "Let me rethink" can improve performance.<n>This raises a natural question: Does thinking more at test-time truly lead to better reasoning?<n>We show a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to "overthinking"
arXiv Detail & Related papers (2025-06-04T17:55:09Z)
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models [43.465268635499754]
Test-time compute has empowered large language models to generate extended reasoning chains.<n>As generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors.
arXiv Detail & Related papers (2025-05-23T05:08:40Z)
Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models [8.97308732968526]
We study the causality of hallucinations under constrained knowledge domains by auditing the Chain-of-Thought trajectory.<n>Our analysis reveals that in long-CoT settings, RLLMs can iteratively reinforce biases and errors through flawed reflective reasoning.<n>Surprisingly, even direct interventions at the origin of hallucinations often fail to reverse their effects.
arXiv Detail & Related papers (2025-05-19T14:11:09Z)
Detection and Mitigation of Hallucination in Large Reasoning Models: A Mechanistic Perspective [11.013059864022667]
Reasoning Hallucinations are logically coherent but factually incorrect reasoning traces.<n>These errors are embedded within structured reasoning, making them more difficult to detect and potentially more harmful.<n>We propose the Reasoning Score, which quantifies the depth of reasoning by measuring the divergence between logits.<n>We also introduce GRPO-R, an enhanced reinforcement learning algorithm that incorporates step-level deep reasoning rewards via potential-based shaping.
arXiv Detail & Related papers (2025-05-19T09:16:40Z)
The Law of Knowledge Overshadowing: Towards Understanding, Predicting, and Preventing LLM Hallucination [85.18584652829799]
We introduce a novel framework to quantify factual hallucinations by modeling knowledge overshadowing.<n>We propose a new decoding strategy CoDa, to mitigate hallucinations, which notably enhance model factuality on Overshadow (27.9%), MemoTrap (13.1%) and NQ-Swap (18.3%)
arXiv Detail & Related papers (2025-02-22T08:36:06Z)
On Large Language Models' Hallucination with Regard to Known Facts [74.96789694959894]
Large language models are successful in answering factoid questions but are also prone to hallucination. We investigate the phenomenon of LLMs possessing correct answer knowledge yet still hallucinating from the perspective of inference dynamics. Our study shed light on understanding the reasons for LLMs' hallucinations on their known facts, and more importantly, on accurately predicting when they are hallucinating.
arXiv Detail & Related papers (2024-03-29T06:48:30Z)
Unfamiliar Finetuning Examples Control How Language Models Hallucinate [75.03210107477157]
Large language models are known to hallucinate when faced with unfamiliar queries. We find that unfamiliar examples in the models' finetuning data are crucial in shaping these errors. Our work further investigates RL finetuning strategies for improving the factuality of long-form model generations.
arXiv Detail & Related papers (2024-03-08T18:28:13Z)
Quantity Matters: Towards Assessing and Mitigating Number Hallucination in Large Vision-Language Models [57.42800112251644]
We focus on a specific type of hallucination-number hallucination, referring to models incorrectly identifying the number of certain objects in pictures. We devise a training approach aimed at improving consistency to reduce number hallucinations, which leads to an 8% enhancement in performance over direct finetuning methods.
arXiv Detail & Related papers (2024-03-03T02:31:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.