Fugu-MT 論文翻訳(概要): Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents

論文の概要: Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents

arxiv url: http://arxiv.org/abs/2510.02204v1
Date: Thu, 02 Oct 2025 16:51:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:21.227285
Title: Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents
Title（参考訳）: 言い換えれば、もう1つは? VLMによるモバイル利用エージェントにおける推論実行ギャップの診断
Authors: Lingzhong Dong, Ziqi Zhou, Shuaibo Yang, Haiyue Sheng, Pengzhou Cheng, Zongru Wu, Zheng Wu, Gongshen Liu, Zhuosheng Zhang,
Abstract要約: 視覚言語モデル(VLM)を利用したモバイル利用エージェントは、自然言語命令の解釈に大きな可能性を示している。近年の研究では、チェーン・オブ・ソート(CoT)推論を取り入れることで、実行精度が向上する傾向が示唆されている。既存の評価では、CoT推論が地道行動と一致しているかどうかを無視しながら実行精度を強調している。
参考スコア（独自算出の注目度）: 24.363473366637376
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mobile-use agents powered by vision-language models (VLMs) have shown great potential in interpreting natural language instructions and generating corresponding actions based on mobile graphical user interface. Recent studies suggest that incorporating chain-of-thought (CoT) reasoning tends to improve the execution accuracy. However, existing evaluations emphasize execution accuracy while neglecting whether CoT reasoning aligns with ground-truth actions. This oversight fails to assess potential reasoning-execution gaps, which in turn foster over-trust: users relying on seemingly plausible CoTs may unknowingly authorize harmful actions, potentially resulting in financial loss or trust crisis. In this work, we introduce a new evaluation framework to diagnose reasoning-execution gaps. At its core lies Ground-Truth Alignment (GTA), which measures whether the action implied by a CoT matches the ground-truth action. By combining GTA with the standard Exact Match (EM) metric, we jointly assess both the reasoning accuracy and execution accuracy. This joint perspective reveals two types of reasoning-execution gaps: (i) Execution Gap (EG), where the reasoning correctly identifies the correct action but execution fails, and (ii) Reasoning Gap (RG), where execution succeeds but reasoning process conflicts with the actual execution. Experimental results across a wide range of mobile interaction tasks reveal that reasoning-execution gaps are prevalent, with execution gaps occurring more frequently than reasoning gaps. Moreover, while scaling up model size reduces the overall gap, sizable execution gaps persist even in the largest models. Further analysis shows that our framework reliably reflects systematic EG/RG patterns in state-of-the-art models. These findings offer concrete diagnostics and support the development of more trustworthy mobile-use agents.
Abstract（参考訳）: 視覚言語モデル(VLM)を利用したモバイル利用エージェントは、自然言語命令を解釈し、モバイルグラフィカルユーザインタフェースに基づいて対応するアクションを生成する大きな可能性を示している。近年の研究では、チェーン・オブ・ソート(CoT)推論を取り入れることで、実行精度が向上する傾向が示唆されている。しかし、既存の評価では、CoT推論が地道行動と一致しているかどうかを無視しながら実行精度を強調している。一見もっともらしいCoTに依存するユーザは、有害な行為を無意識に承認し、金融損失や信用危機をもたらす可能性がある。本研究では,推論と実行のギャップを診断するための新しい評価フレームワークを提案する。中心となるグラウンド・トゥルース・アライメント(GTA)は、CoTが入力したアクションがグラウンド・トゥルース・アクションと一致するかどうかを測定する。 GTAとEM(Exact Match)の基準値を組み合わせることで,推論精度と実行精度を両立させる。この共同視点は2種類の推論と実行のギャップを明らかにします。 (i)実行ギャップ(EG) 正しい動作を正しく識別するが実行が失敗する理由 (ii)実行が成功するが、推論プロセスが実際の実行と矛盾するRG(Reasoning Gap)。多様なモバイルインタラクションタスクに対する実験結果から,推論と実行のギャップが増加し,実行のギャップが推論のギャップよりも頻繁に発生することが明らかとなった。さらに、モデルサイズをスケールアップすると全体的なギャップが減るが、大きなモデルでも大きな実行ギャップが持続する。さらに分析した結果,我々のフレームワークは最先端モデルにおける系統的なEG/RGパターンを確実に反映していることがわかった。これらの知見は、具体的な診断と、より信頼できるモバイル利用エージェントの開発を支援するものである。

論文の概要: Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents

関連論文リスト