Fugu-MT 論文翻訳(概要): Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

論文の概要: Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

arxiv url: http://arxiv.org/abs/2510.12712v1
Date: Tue, 14 Oct 2025 16:50:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.401024
Title: Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning
Title（参考訳）: Beyond Seeing: ツールによる画像認識・変換・推論におけるマルチモーダルLCMの評価
Authors: Xingang Guo, Utkarsh Tyagi, Advait Gosai, Paula Vergara, Ernesto Gabriel Hernández Montoya, Chen Bo Calvin Zhang, Bin Hu, Yunzhong He, Bing Liu, Rakshith Sharma Srinivasa,
Abstract要約: MLLM(Multimodal Large Language Models)は、ユーザが提供するイメージが不完全である実世界のシナリオにおいて、ますます適用されてきている。我々は、複雑な視覚・テクスチャタスクに対してMLLMの知覚、変換、推論能力を評価する、画像・システムとの対話的推論であるIRISを紹介する。評価の結果,現在のMLLMは視覚と汎用ツールの効果的な統合を必要とするタスクに苦慮していることがわかった。
参考スコア（独自算出の注目度）: 16.686834520228132
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Multimodal Large Language Models (MLLMs) are increasingly applied in real-world scenarios where user-provided images are often imperfect, requiring active image manipulations such as cropping, editing, or enhancement to uncover salient visual cues. Beyond static visual perception, MLLMs must also think with images: dynamically transforming visual content and integrating it with other tools to solve complex tasks. However, this shift from treating vision as passive context to a manipulable cognitive workspace remains underexplored. Most existing benchmarks still follow a think about images paradigm, where images are regarded as static inputs. To address this gap, we introduce IRIS, an Interactive Reasoning with Images and Systems that evaluates MLLMs' ability to perceive, transform, and reason across complex visual-textual tasks under the think with images paradigm. IRIS comprises 1,204 challenging, open-ended vision tasks (603 single-turn, 601 multi-turn) spanning across five diverse domains, each paired with detailed rubrics to enable systematic evaluation. Our evaluation shows that current MLLMs struggle with tasks requiring effective integration of vision and general-purpose tools. Even the strongest model (GPT-5-think) reaches only 18.68% pass rate. We further observe divergent tool-use behaviors, with OpenAI models benefiting from diverse image manipulations while Gemini-2.5-pro shows no improvement. By introducing the first benchmark centered on think with images, IRIS offers critical insights for advancing visual intelligence in MLLMs.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、ユーザーが提供した画像が不完全な場合が多い現実のシナリオにおいて、収穫、編集、視覚的な透明な手がかりを明らかにするために、アクティブな画像操作を必要とする。静的な視覚認識以外にも、MLLMは視覚コンテンツを動的に変換し、複雑なタスクを解決するために他のツールと統合する、イメージについても考える必要がある。しかし、視覚を受動的文脈として扱うことから、操作可能な認知ワークスペースへのシフトは、まだ未解明のままである。既存のベンチマークのほとんどは、画像が静的な入力と見なされるイメージパラダイムについて、依然としてフォローしている。このギャップに対処するために、画像とシステムとの対話的推論であるIRISを導入し、画像パラダイムを用いた思考下での複雑な視覚的テキストタスクに対するMLLMの知覚、変換、理性を評価する。 IRISは5つの異なる領域にまたがる1,204の挑戦的でオープンエンドの視覚タスク(603のシングルターン、601のマルチターン)で構成されており、それぞれに詳細なルーリックが組み合わされ、体系的な評価を可能にする。評価の結果,現在のMLLMは視覚と汎用ツールの効果的な統合を必要とするタスクに苦慮していることがわかった。最強のモデル(GPT-5-think)でさえ18.68%のパスレートにしか達していない。 Gemini-2.5-proは改善しないが,OpenAIモデルは多様な画像操作の恩恵を受ける。 IRISはイメージによる思考に焦点を当てた最初のベンチマークを導入することで、MLLMにおける視覚知能の進歩に対する重要な洞察を提供する。

論文の概要: Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

関連論文リスト