Related papers: ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

URL: http://arxiv.org/abs/2501.01290v1
Date: Thu, 02 Jan 2025 15:10:52 GMT
Title: ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark
Authors: Vaskar Nath, Pranav Raja, Claire Yoon, Sean Hendryx,
Abstract summary: We introduce ToolComp, a benchmark designed to evaluate multi-step tool-use reasoning.<n>ToolComp is developed through a collaboration between models and human annotators.<n>We generate synthetic training data to compare the performance of outcome-supervised reward models with process-supervised reward models.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite recent advances in AI, the development of systems capable of executing complex, multi-step reasoning tasks involving multiple tools remains a significant challenge. Current benchmarks fall short in capturing the real-world complexity of tool-use reasoning, where verifying the correctness of not only the final answer but also the intermediate steps is important for evaluation, development, and identifying failures during inference time. To bridge this gap, we introduce ToolComp, a comprehensive benchmark designed to evaluate multi-step tool-use reasoning. ToolComp is developed through a collaboration between models and human annotators, featuring human-edited/verified prompts, final answers, and process supervision labels, allowing for the evaluation of both final outcomes and intermediate reasoning. Evaluation across six different model families demonstrates the challenging nature of our dataset, with the majority of models achieving less than 50% accuracy. Additionally, we generate synthetic training data to compare the performance of outcome-supervised reward models (ORMs) with process-supervised reward models (PRMs) to assess their ability to improve complex tool-use reasoning as evaluated by ToolComp. Our results show that PRMs generalize significantly better than ORMs, achieving a 19% and 11% improvement in rank@1 accuracy for ranking base and fine-tuned model trajectories, respectively. These findings highlight the critical role of process supervision in both the evaluation and training of AI models, paving the way for more robust and capable systems in complex, multi-step tool-use tasks.

Related papers

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs [27.07998056454784]
ReTool enhances long-form reasoning with tool-integrated learning. Model achieves 67% accuracy with 400 training steps. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings.
arXiv Detail & Related papers (2025-04-15T18:10:22Z)
Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance. We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step. Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy. We propose Reasoning-Driven Process Reward Modeling (R-PRM) R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z)
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset. We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z)
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models [31.426274932333264]
We present Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive for users to understand when and why a model performs better or worse than a baseline model.
arXiv Detail & Related papers (2024-02-16T09:14:49Z)
Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision [40.984680166762345]
We introduce Model-induced Process Supervision (MiPS), a novel method for automating data curation. MiPS annotates an intermediate step by sampling completions of this solution through the reasoning model, and obtaining an accuracy defined as the proportion of correct completions. Our approach significantly improves the performance of PaLM 2 on math and coding tasks.
arXiv Detail & Related papers (2024-02-05T00:57:51Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z)
Let's Verify Step by Step [73.58107073356732]
We show that process supervision significantly outperforms outcome supervision for training models to solve problems. Our model solves 78% of problems from a representative subset of the MATH test set. We also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
arXiv Detail & Related papers (2023-05-31T17:24:00Z)
Discover, Explanation, Improvement: An Automatic Slice Detection Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints. This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks. Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.