Related papers: Visual Agentic Reinforcement Fine-Tuning

Visual Agentic Reinforcement Fine-Tuning

URL: http://arxiv.org/abs/2505.14246v1
Date: Tue, 20 May 2025 11:59:25 GMT
Title: Visual Agentic Reinforcement Fine-Tuning
Authors: Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang,
Abstract summary: This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs)<n>With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques.<n>Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search
Score: 73.37007472426299
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source research community, while significant progress has been made in language-only agentic abilities such as function calling and tool integration, the development of multi-modal agentic capabilities that involve truly thinking with images, and their corresponding benchmarks, are still less explored. This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT) with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs' agentic search and coding abilities. Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, ultimately surpassing GPT-4o. Visual-ARFT also achieves +29.3 F1% / +25.9% EM gains on existing multi-hop QA benchmarks such as 2Wiki and HotpotQA, demonstrating strong generalization capabilities. Our findings suggest that Visual-ARFT offers a promising path toward building robust and generalizable multimodal agents.

Related papers

SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning [57.083359974905655]
SenseNova-MARS is a novel Multimodal Agentic Reasoning and Search framework.<n>It dynamically integrates the image search, text search, and image crop tools to tackle knowledge-intensive visual understanding challenges.<n> SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks.
arXiv Detail & Related papers (2025-12-30T16:31:45Z)
Training Multi-Image Vision Agents via End2End Reinforcement Learning [51.81337984526068]
We propose IMAgent, an open-source vision agent trained via end-to-end reinforcement learning.<n>By leveraging a multi-agent system, we generate challenging and visually-rich multi-image QA pairs.<n>We develop two specialized tools for visual reflection and confirmation, allowing the model to proactively reallocate its attention to image content.
arXiv Detail & Related papers (2025-12-05T10:02:38Z)
Thinking with Programming Vision: Towards a Unified View for Thinking with Images [23.596757163808906]
We show that even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions.<n>We propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation.
arXiv Detail & Related papers (2025-12-03T12:44:15Z)
TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning [30.018325742295243]
OpenAI o3 can create and operate tools to transform images for problem-solving, also known as thinking-textitwith-images in chain-of-thought.<n>Visual Search tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning.<n>We introduce textbfTIR-Bench, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks.
arXiv Detail & Related papers (2025-11-03T18:40:17Z)
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning [57.89304342666846]
We introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs.<n>We propose a novel reinforcement learning framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools.<n>V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies.
arXiv Detail & Related papers (2025-05-13T14:35:51Z)
Visual-RFT: Visual Reinforcement Fine-Tuning [75.20572976629646]
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers.<n>Visual-RFT further extends the application areas of RFT on visual tasks.
arXiv Detail & Related papers (2025-03-03T18:16:32Z)
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.<n>It aims to localize instances of interest across multiple images based on open-ended text prompts.<n>We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z)
AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning [93.96463520716759]
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and hallucinations. Here, we introduce AvaTaR, a novel and automated framework that optimize an LLM agent to effectively leverage provided tools, improving performance on a given task.
arXiv Detail & Related papers (2024-06-17T04:20:02Z)
Raising the Bar of AI-generated Image Detection with CLIP [50.345365081177555]
The aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images. We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios.
arXiv Detail & Related papers (2023-11-30T21:11:20Z)
Learning Robust Visual-Semantic Embedding for Generalizable Person Re-identification [11.562980171753162]
Generalizable person re-identification (Re-ID) is a very hot research topic in machine learning and computer vision. Previous methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training. We propose a Multi-Modal Equivalent Transformer called MMET for more robust visual-semantic embedding learning.
arXiv Detail & Related papers (2023-04-19T08:37:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.