Fugu-MT 論文翻訳(概要): Advancing Creative Physical Intelligence in Large Multimodal Models

論文の概要: Advancing Creative Physical Intelligence in Large Multimodal Models

arxiv url: http://arxiv.org/abs/2605.26396v1
Date: Mon, 25 May 2026 23:59:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.507443
Title: Advancing Creative Physical Intelligence in Large Multimodal Models
Title（参考訳）: 大規模マルチモーダルモデルにおける創造的物理インテリジェンスの向上
Authors: Cheng Qian, Hyeonjeong Ha, Jiayu Liu, Jeonghwan Kim, Emre Can Acikgoz, Bingxuan Li, Kunlun Zhu, Jiateng Liu, Aditi Tiwari, Zhenhailong Wang, Xiusi Chen, Mahdi Namazifar, Heng Ji,
Abstract要約: MM-CreativityBenchは、視覚的にリッチで物理的に制約のある環境において、手頃なグラウンドで使用されるクリエイティブツールのベンチマークである。筆者らの実験では、現在のLMMは、生成能力の欠如によるものではなく、基底探索を維持できないため、しばしば短くなることが示されている。この障害モードを動機として,創造的ツールの使用を優先学習問題とするアライメント(アライメント,アライメント,アライメント)を提案する。
参考スコア（独自算出の注目度）: 62.56522271769017
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.
Abstract（参考訳）: 大規模マルチモーダルモデル(LMM)は、認識と推論において急速に進歩しているが、これらの能力が、パターン認識以外のオープンな環境において、視覚的に基底付けられた解を発見するために一般化されるかどうかは不明である。このような設定では、インテリジェンスは、よく考えられた質問に答える以上のものを必要とします。この創造的な問題解決は人間の知性の中心であるが、現在のベンチマークでは証明されていない。この能力を評価するために,視覚的にリッチで物理的に制約のある環境において,手頃な創造ツールのベンチマークであるMM-CreativityBenchを導入する。各インスタンスは、候補エンティティとその部分の構造化されたビューを持つシナリオイメージを表示し、モデルがどのようにシーンを反復的に検査するかを微妙にインタラクティブに評価し、関連する価格を特定し、視覚的および物理的に根ざしたソリューションを構成する。筆者らの実験では、現在のLMMは、生成能力の欠如によるものではなく、基底探索を維持できないため、しばしば短くなることが示されている。モデルは、しばしば、関連エンティティ、過小評価クリティカルな部分、あるいは画像に基づかない幻覚特性を見落とします。この障害モードを動機として,創造的ツールの使用を優先学習問題とするアライメント(アライメント,アライメント,アライメント)を提案する。直接選好最適化を用いて、幻覚的代替品に対する視覚的証拠に基づく属性認識推論をモデルに推奨する。さらに,より広範なエンティティ探索とマルチターン計画の指針として,アベイランス・ナレッジ・ベースから導かれるインスペクションを取り入れた。本研究の結果は, 正しい実体や部品の選択において一貫した利得を示し, 幻覚や接地関連誤差を著しく低減した。

論文の概要: Advancing Creative Physical Intelligence in Large Multimodal Models

関連論文リスト