Fugu-MT 論文翻訳(概要): Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots

論文の概要: Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots

arxiv url: http://arxiv.org/abs/2604.17817v1
Date: Mon, 20 Apr 2026 05:15:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.703857
Title: Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots
Title（参考訳）: LLMはあらゆるものを見る必要があるか?スクリーンテキスト対スクリーンショットを用いたLCM駆動型スマートフォン自動化の失敗のベンチマークと研究
Authors: Shiquan Zhang, Tianyi Zhang, Le Fang, Simon D'Alfonso, Hong Jia, Vassilis Kostakos,
Abstract要約: DailyDroidは、25のAndroidアプリにまたがる5つのシナリオで75のタスクをベンチマークします。 GPT-4oとo4-miniのテキストのみとマルチモーダル(テキスト+スクリーンショット)入力を用いて300回の試験で評価し、マルチモーダル入力と同等の性能を示し、成功率を極端に向上させた。
参考スコア（独自算出の注目度）: 15.63408997133083
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rapid advancement of large language models (LLMs), mobile agents have emerged as promising tools for phone automation, simulating human interactions on screens to accomplish complex tasks. However, these agents often suffer from low accuracy, misinterpretation of user instructions, and failure on challenging tasks, with limited prior work examining why and where they fail. To address this, we introduce DailyDroid, a benchmark of 75 tasks in five scenarios across 25 Android apps, spanning three difficulty levels to mimic everyday smartphone use. We evaluate it using text-only and multimodal (text + screenshot) inputs on GPT-4o and o4-mini across 300 trials, revealing comparable performance with multimodal inputs yielding marginally higher success rates. Through in-depth failure analysis, we compile a handbook of common failures. Our findings reveal critical issues in UI accessibility, input modalities, and LLM/app design, offering implications for future mobile agents, applications, and UI development.
Abstract（参考訳）: 大規模言語モデル(LLM)の急速な進歩により、モバイルエージェントは、複雑なタスクを達成するためにスクリーン上のヒューマンインタラクションをシミュレートする、電話自動化のための有望なツールとして登場した。しかし、これらのエージェントは、しばしば低い精度、ユーザーの指示の誤解釈、困難なタスクの失敗に悩まされる。これを解決するために、25のAndroidアプリで5つのシナリオで75のタスクをベンチマークしたDailyDroidを紹介した。 GPT-4oとo4-miniのテキストのみとマルチモーダル(テキスト+スクリーンショット)入力を用いて300回の試験で評価し、マルチモーダル入力と同等の性能を示し、成功率を極端に向上させた。詳細な故障解析を通じて、一般的な故障のハンドブックをコンパイルする。この結果から,UIアクセシビリティ,入力モダリティ,LLM/アプリ設計における重要な課題が明らかとなり,将来のモバイルエージェント,アプリケーション,UI開発に影響を及ぼす可能性が示唆された。

論文の概要: Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots

関連論文リスト