Fugu-MT 論文翻訳(概要): Visual Agentic Reinforcement Fine-Tuning

論文の概要: Visual Agentic Reinforcement Fine-Tuning

arxiv url: http://arxiv.org/abs/2505.14246v1
Date: Tue, 20 May 2025 11:59:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-21 14:49:53.154288
Title: Visual Agentic Reinforcement Fine-Tuning
Title（参考訳）: ビジュアルエージェント強化ファインチューニング
Authors: Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang,
Abstract要約: この研究は、大規模視覚言語モデル(LVLM)における柔軟な適応推論能力を実現するための視覚エージェント強化細調整(Visual-ARFT)の有効性を強調した。 Visual-ARFTによって、オープンソースのLVLMは、リアルタイム情報更新のためのWebサイトをブラウズし、コードを書き、トリミング、回転、その他の画像処理技術を通じて入力画像を操作および解析することが可能になる。実験の結果,Visual-ARFT は MAT-Coding で +18.6% F1 / +13.0% EM ,MAT-Search で +10.3% F1 / +8.7% EM で,ベースラインを+18.6% F1 / +13.0% EM で上回ることがわかった。
参考スコア（独自算出の注目度）: 73.37007472426299
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source research community, while significant progress has been made in language-only agentic abilities such as function calling and tool integration, the development of multi-modal agentic capabilities that involve truly thinking with images, and their corresponding benchmarks, are still less explored. This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT) with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs' agentic search and coding abilities. Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, ultimately surpassing GPT-4o. Visual-ARFT also achieves +29.3 F1% / +25.9% EM gains on existing multi-hop QA benchmarks such as 2Wiki and HotpotQA, demonstrating strong generalization capabilities. Our findings suggest that Visual-ARFT offers a promising path toward building robust and generalizable multimodal agents.
Abstract（参考訳）: 大きな推論モデル(例:OpenAIのo3)のキーとなるトレンドは、イメージ操作をイメージで考えるために、Webブラウザなどの外部ツールを検索、書き込み、実行するために使用するネイティブなエージェント機能である。オープンソース研究コミュニティでは、関数呼び出しやツール統合といった言語のみのエージェント能力が大幅に進歩しているが、画像に対する真の思考とそれに対応するベンチマークを含むマルチモーダルエージェント機能の開発は、いまだ研究されていない。この研究は、LVLM(Large Vision-Language Models)におけるフレキシブルで適応的な推論能力を実現するために、Visual Agentic Reinforcement Fine-Tuning(Visual-ARFT)の有効性を強調した。 Visual-ARFTによって、オープンソースのLVLMは、リアルタイム情報更新のためのWebサイトをブラウズし、コードを書き、トリミング、回転、その他の画像処理技術を通じて入力画像を操作および解析することが可能になる。また、LVLMのエージェント検索と符号化能力を評価するために、2つの設定(MAT-SearchとMAT-Coding)を持つマルチモーダルエージェントツールベンチ(MAT)を提案する。実験の結果,Visual-ARFT は MAT-Coding では +18.6% F1 / +13.0% EM,MAT-Search では +10.3% F1 / +8.7% EM で,最終的に GPT-4o を上回った。 Visual-ARFT は 2Wiki や HotpotQA といった既存のマルチホップ QA ベンチマークで +29.3 F1% / +25.9% の EM ゲインを達成した。以上の結果から,Visual-ARFTは,堅牢で汎用的なマルチモーダルエージェントの構築に有望な道筋を提供する可能性が示唆された。

論文の概要: Visual Agentic Reinforcement Fine-Tuning

関連論文リスト