Related papers: TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

URL: http://arxiv.org/abs/2510.06217v1
Date: Tue, 07 Oct 2025 17:59:41 GMT
Title: TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
Authors: Jiaru Zou, Soumya Roy, Vinay Kumar Verma, Ziyi Wang, David Wipf, Pan Lu, Sumit Negi, James Zou, Jingrui He,
Abstract summary: TaTToo is a novel table-grounded PRM framework that integrates tool-based verification to provide precise reward supervision.<n>We train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning to align our model with table-based verification.
Score: 77.01182934427095
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.

Related papers

Step-Tagging: Toward controlling the generation of Language Reasoning Models through step monitoring [5.190961793309368]
A growing body of studies show that Language Reasoning Models (LRMs) are still inefficient, over-generating verification and reflection steps.<n>We introduce the Step-Tagging framework, a lightweight sentence-classifier enabling real-time annotation of the type of reasoning steps that an LRM is generating.<n>Online monitoring of the count of specific steps can produce effective interpretable early stopping criteria of LRM inferences.
arXiv Detail & Related papers (2025-12-16T12:01:16Z)
One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning [54.580646706013965]
Reward models (RMs) play a critical role in aligning large language models with human preferences.<n>We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios.<n>To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling.
arXiv Detail & Related papers (2025-10-30T06:08:27Z)
CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling [60.55856973678002]
Large Reasoning Models (LRMs) have demonstrated strong capabilities in complex multi-step reasoning.<n>Existing domain adaptation methods, originally designed for earlier instruction-tuned models, often fail to exploit the advanced reasoning patterns of modern LRMs.<n>We propose textbfCALM, a framework that progressively refines LRMs within their native reasoning modes for optimization modeling tasks.
arXiv Detail & Related papers (2025-10-05T13:38:31Z)
TableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning [10.267950603662776]
TableMind is a tool-integrated table reasoning agent that autonomously performs multi-turn tool invocation, writes and executes code in a secure sandbox environment for data analysis and precise numerical reasoning.<n>To realize these capabilities, we adopt a two-stage fine-tuning paradigm built on top of a powerful pre-trained language model.
arXiv Detail & Related papers (2025-09-08T02:00:31Z)
KAT-V1: Kwai-AutoThink Technical Report [50.84483585850113]
We present Kwaipilot-AutoThink (KAT), an open-source 40B large language model developed to address the overthinking problem in reasoning-intensive tasks.<n>KAT dynamically switches between reasoning and non-reasoning modes based on task complexity.<n>We also propose Step-SRPO, a reinforcement learning algorithm that incorporates intermediate supervision into the GRPO framework.
arXiv Detail & Related papers (2025-07-11T04:07:10Z)
ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [75.72672339168092]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z)
SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models [2.7645012830234]
Large reasoning models excel at complex reasoning tasks but typically generate lengthy sequential chains-of-thought.<n>SPRINT is a novel post-training and inference-time framework designed to enable LRMs to dynamically identify and exploit opportunities for parallelization.<n>We show that the models fine-tuned with the SPRINT framework match the performance of reasoning models on complex domains such as mathematics.
arXiv Detail & Related papers (2025-06-06T05:10:31Z)
Table-R1: Inference-Time Scaling for Table Reasoning [56.812846737424245]
We develop and evaluate two post-training strategies to enable inference-time scaling.<n>For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1.<n>For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model.
arXiv Detail & Related papers (2025-05-29T16:28:50Z)
Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process-Supervised Rewards [25.810871864483076]
External Process Reward Models (PRMs) can be introduced during training to provide fine-grained supervision.<n>We propose Reward-BIRD, a framework that explores how to incorporate PRMs into the Text-to-the- reasoning process effectively.
arXiv Detail & Related papers (2025-05-07T08:32:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.