TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture
- URL: http://arxiv.org/abs/2510.01279v1
- Date: Tue, 30 Sep 2025 19:19:56 GMT
- Title: TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture
- Authors: Yongchao Chen, Jiefeng Chen, Rui Meng, Ji Yin, Na Li, Chuchu Fan, Chi Wang, Tomas Pfister, Jinsung Yoon,
- Abstract summary: We propose an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths.<n>TUmix achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods.<n>We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs.
- Score: 60.945393748584316
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.
Related papers
- AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning [66.24374176797075]
We introduce textbfAdaReasoner, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior.<n>AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that prioritizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage.
arXiv Detail & Related papers (2026-01-26T16:04:43Z) - ML-Tool-Bench: Tool-Augmented Planning for ML Tasks [23.54937738755734]
We introduce a benchmark for evaluating tool-augmented machine learning agents.<n>Our benchmark goes beyond traditional tool-use evaluation by incorporating an in-memory named object management.<n>Our approach improves over ReAct by 16.52 percentile positions, taking the median across all Kaggle challenges.
arXiv Detail & Related papers (2025-11-29T23:59:40Z) - One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning [54.580646706013965]
Reward models (RMs) play a critical role in aligning large language models with human preferences.<n>We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios.<n>To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling.
arXiv Detail & Related papers (2025-10-30T06:08:27Z) - Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use [50.02614257515131]
Large language models (LLMs) have demonstrated strong capabilities in language understanding and reasoning.<n>We propose Tool-R1, a reinforcement learning framework that enables LLMs to perform general, compositional, and multi-step tool use.
arXiv Detail & Related papers (2025-09-16T09:22:21Z) - Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning [69.32855772335624]
Multimodal agents, which integrate a controller e.g., a vision language model, with external tools, have demonstrated remarkable capabilities in tackling complex multimodal tasks.<n>Existing approaches for training these agents depend on extensive human-annotated task-answer pairs and tool trajectories.<n>We propose an iterative tool usage exploration method for multimodal agents without any pre-collected data, namely SPORT.<n>SPORT has four iterative components: task synthesis, step sampling, step verification, and preference tuning.
arXiv Detail & Related papers (2025-04-30T12:01:27Z) - Acting Less is Reasoning More! Teaching Model to Act Efficiently [87.28134636548705]
Tool-integrated reasoning augments large language models with the ability to invoke external tools to solve tasks.<n>Current approaches typically optimize only for final correctness without considering the efficiency or necessity of external tool use.<n>We propose a framework that encourages models to produce accurate answers with minimal tool calls.<n>Our approach reduces tool calls by up to 68.3% and improves tool productivity by up to 215.4%, while maintaining comparable answer accuracy.
arXiv Detail & Related papers (2025-04-21T05:40:05Z) - Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning [76.10639521319382]
We propose Symbolic-MoE, a symbolic, text-based, and gradient-free Mixture-of-Experts framework.<n>We show Symbolic-MoE beats strong LLMs like GPT4o-mini, as well as multi-agent approaches, with an absolute avg. gain of 8.15% over the best multi-agent baseline.
arXiv Detail & Related papers (2025-03-07T18:03:13Z) - Multi-tool Integration Application for Math Reasoning Using Large Language Model [1.4582633500696451]
This article proposes a novel multi tool application framework for mathematical reasoning.
It aims to achieve more comprehensive and accurate mathematical reasoning by utilizing the collaborative effect of large language models (LLMs) and multiple external tools.
arXiv Detail & Related papers (2024-08-22T06:27:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.