MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
- URL: http://arxiv.org/abs/2510.08567v3
- Date: Tue, 21 Oct 2025 10:22:02 GMT
- Title: MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
- Authors: Tajamul Ashraf, Umair Nawaz, Abdelrahman M. Shaker, Rao Anwer, Philip Torr, Fahad Shahbaz Khan, Salman Khan,
- Abstract summary: We develop a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories.<n>We also introduce Pref-X, a set of 11K automatically generated preference pairs.<n>Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs.
- Score: 65.200259961515
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.
Related papers
- MTDrive: Multi-turn Interactive Reinforcement Learning for Autonomous Driving [12.330414519761524]
Trajectory planning is a core task in autonomous driving.<n>MLLMs with Reinforcement Learning have shown promise in addressing "long-tail" scenarios.<n>We present MTDrive, a multi-turn framework that enables MLLMs to iteratively refine trajectories based on environmental feedback.
arXiv Detail & Related papers (2026-01-30T12:47:55Z) - Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning [26.35834992466776]
ATLAS is a dual-path framework for dynamic tool usage in cross-domain complex reasoning.<n>Our framework shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.
arXiv Detail & Related papers (2026-01-07T12:38:33Z) - AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning [79.65732142949014]
Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories.<n>Existing approaches assume a fixed inventory of tools, limiting LLM agents' adaptability to new or evolving toolsets.<n>We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories.
arXiv Detail & Related papers (2025-12-15T12:38:04Z) - TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments [30.078263383249862]
Toucan is the largest publicly available tool-agentic dataset to date.<n>It generates diverse, realistic, and challenging tasks with trajectories involving real tool execution.
arXiv Detail & Related papers (2025-10-01T17:58:03Z) - Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning [63.31585771716123]
Large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL)<n>We introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning.<n>Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training.
arXiv Detail & Related papers (2025-05-22T09:00:19Z) - Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning [69.32855772335624]
Multimodal agents, which integrate a controller e.g., a vision language model, with external tools, have demonstrated remarkable capabilities in tackling complex multimodal tasks.<n>Existing approaches for training these agents depend on extensive human-annotated task-answer pairs and tool trajectories.<n>We propose an iterative tool usage exploration method for multimodal agents without any pre-collected data, namely SPORT.<n>SPORT has four iterative components: task synthesis, step sampling, step verification, and preference tuning.
arXiv Detail & Related papers (2025-04-30T12:01:27Z) - Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage [75.76940471949366]
We propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data.<n>To preserve the data quality, we prompt the GPT-4o mini model to generate queries, files, and trajectories.<n> Evaluations show that the T3-Agent consistently achieves improvements on two popular VLMs.
arXiv Detail & Related papers (2024-12-20T07:00:46Z) - Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation [51.20656279478878]
MATRIX is a multi-agent simulator that automatically generates diverse text-based scenarios.<n>We introduce MATRIX-Gen for controllable and highly realistic data synthesis.<n>On AlpacaEval 2 and Arena-Hard benchmarks, Llama-3-8B-Base, post-trained on datasets synthesized by MATRIX-Gen with just 20K instruction-response pairs, outperforms Meta's Llama-3-8B-Instruct model.
arXiv Detail & Related papers (2024-10-18T08:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.