Think Right, Not More: Test-Time Scaling for Numerical Claim Verification
- URL: http://arxiv.org/abs/2509.22101v1
- Date: Fri, 26 Sep 2025 09:23:35 GMT
- Title: Think Right, Not More: Test-Time Scaling for Numerical Claim Verification
- Authors: Primakov Chungkham, V Venktesh, Vinay Setty, Avishek Anand,
- Abstract summary: We show that test-time compute can be used to verify complex numerical claims.<n>We introduce an adaptive mechanism that performs TTS selectively based on the perceived complexity of the claim.<n>This approach achieves 1.8x higher efficiency than standard TTS, while delivering a notable 18.8% performance improvement over single-shot claim verification methods.
- Score: 14.07771397213171
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fact-checking real-world claims, particularly numerical claims, is inherently complex that require multistep reasoning and numerical reasoning for verifying diverse aspects of the claim. Although large language models (LLMs) including reasoning models have made tremendous advances, they still fall short on fact-checking real-world claims that require a combination of compositional and numerical reasoning. They are unable to understand nuance of numerical aspects, and are also susceptible to the reasoning drift issue, where the model is unable to contextualize diverse information resulting in misinterpretation and backtracking of reasoning process. In this work, we systematically explore scaling test-time compute (TTS) for LLMs on the task of fact-checking complex numerical claims, which entails eliciting multiple reasoning paths from an LLM. We train a verifier model (VERIFIERFC) to navigate this space of possible reasoning paths and select one that could lead to the correct verdict. We observe that TTS helps mitigate the reasoning drift issue, leading to significant performance gains for fact-checking numerical claims. To improve compute efficiency in TTS, we introduce an adaptive mechanism that performs TTS selectively based on the perceived complexity of the claim. This approach achieves 1.8x higher efficiency than standard TTS, while delivering a notable 18.8% performance improvement over single-shot claim verification methods. Our code and data can be found at https://github.com/VenkteshV/VerifierFC
Related papers
- Chain-of-thought Reviewing and Correction for Time Series Question Answering [22.889720488678076]
We propose T3LLM, which performs multi-step reasoning with an explicit correction mechanism for time series question answering.<n>Within this framework, the worker generates step-wise chains of thought (CoT) under structured prompts, while the reviewer inspects the reasoning, identifies erroneous steps, and provides corrective comments.<n>Experiments on multiple real-world TSQA benchmarks demonstrate that T3LLM achieves state-of-the-art performance over strong LLM-based baselines.
arXiv Detail & Related papers (2025-12-27T15:54:18Z) - DTS: Enhancing Large Reasoning Models via Decoding Tree Sketching [54.98126916293868]
Large Reasoning Models (LRMs) produce excessively long chain-of-thought traces that degrade accuracy.<n>We propose a model-agnostic decoding framework that sketches the reasoning space by branching at high-entropy tokens and applies early stopping to select the shortest completed reasoning path.<n>This approach approximates the optimal solution that enhances both efficiency and accuracy, without requiring additional training or supervision.
arXiv Detail & Related papers (2025-11-01T17:41:28Z) - Test-Time Scaling of Reasoning Models for Machine Translation [16.317481079574065]
Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding.<n>This paper investigates whether increased inference-time computation improves translation quality.
arXiv Detail & Related papers (2025-10-07T21:15:18Z) - Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z) - Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching [64.74765550805024]
Chain-of-Thought prompting elicits step-by-step problem solving, but often at the cost of excessive verbosity in intermediate outputs.<n>We propose Sketch-of-Thought (SoT), a prompting framework that integrates cognitively inspired reasoning paradigms with linguistic constraints.<n>SoT achieves token reductions of up to 84% with minimal accuracy loss across 18 reasoning datasets.
arXiv Detail & Related papers (2025-03-07T06:57:17Z) - Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps [3.8936716676293917]
This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data.<n>We identify a critical parameter threshold (1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning.
arXiv Detail & Related papers (2025-02-21T00:48:32Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - From Chaos to Clarity: Claim Normalization to Empower Fact-Checking [57.024192702939736]
Claim Normalization (aka ClaimNorm) aims to decompose complex and noisy social media posts into more straightforward and understandable forms.
We propose CACN, a pioneering approach that leverages chain-of-thought and claim check-worthiness estimation.
Our experiments demonstrate that CACN outperforms several baselines across various evaluation measures.
arXiv Detail & Related papers (2023-10-22T16:07:06Z) - FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs.
FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation.
We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z) - Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks.
LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes.
We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.