Related papers: MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

URL: http://arxiv.org/abs/2507.21924v1
Date: Tue, 29 Jul 2025 15:39:14 GMT
Title: MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning
Authors: Tianhong Gao, Yannian Fu, Weiqun Wu, Haixiao Yue, Shanshan Liu, Gang Zhang,
Abstract summary: MMAT-1M is the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage.<n>Our dataset is constructed through a novel four-stage data engine.<n>By fine-tuning open-source multimodal models on the MMAT-1M, we observe significant performance gains.
Score: 4.963955559863751
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs), enhanced through agent tuning, have demonstrated remarkable capabilities in Chain-of-Thought (CoT) and tool utilization, significantly surpassing the performance of standalone models. However, the multimodal domain still lacks a large-scale, high-quality agent tuning dataset to unlock the full potential of multimodal large language models. To bridge this gap, we introduce MMAT-1M, the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage. Our dataset is constructed through a novel four-stage data engine: 1) We first curate publicly available multimodal datasets containing question-answer pairs; 2) Then, leveraging GPT-4o, we generate rationales for the original question-answer pairs and dynamically integrate API calls and Retrieval Augmented Generation (RAG) information through a multi-turn paradigm; 3) Furthermore, we refine the rationales through reflection to ensure logical consistency and accuracy, creating a multi-turn dialogue dataset with both Rationale and Reflection (RR); 4) Finally, to enhance efficiency, we optionally compress multi-turn dialogues into a One-turn Rationale and Reflection (ORR) format. By fine-tuning open-source multimodal models on the MMAT-1M, we observe significant performance gains. For instance, the InternVL2.5-8B-RR model achieves an average improvement of 2.7% across eight public benchmarks and 8.8% on the RAG benchmark Dyn-VQA, demonstrating the dataset's effectiveness in enhancing multimodal reasoning and tool-based capabilities. The dataset is publicly available at https://github.com/VIS-MPU-Agent/MMAT-1M.

Related papers

CIMR: Contextualized Iterative Multimodal Reasoning for Robust Instruction Following in LVLMs [2.238122883754112]
CIMR is a novel framework that introduces a context-aware iterative reasoning and self-correction module.<n> CIMR achieves 91.5% accuracy, outperforming state-of-the-art models such as GPT-4V, LLaVA-1.5, MiniGPT-4, and InstructBLIP.
arXiv Detail & Related papers (2025-07-22T18:39:18Z)
MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization [74.04867639197445]
MiroMind-M1 is a set of fully open-source RLMs built on the Qwen-2.5-based benchmarks.<n>Our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems.
arXiv Detail & Related papers (2025-07-19T16:21:23Z)
WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning [17.459985667824807]
Building on the success of text-based reasoning models like DeepSeek-R1, extending these capabilities to multimodal reasoning holds great promise.<n>In this paper, we show how to achieve the general-purpose visual-language reasoning through reinforcement learning.
arXiv Detail & Related papers (2025-06-09T16:20:54Z)
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs)<n>Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks.<n>For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z)
Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.7225771305861]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M$2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models.<n>The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking.<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z)
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z)
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
We introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs.<n>Specifically, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset.<n>We explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance.
arXiv Detail & Related papers (2024-11-15T18:59:27Z)
MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.