Fugu-MT 論文翻訳(概要): Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

論文の概要: Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

arxiv url: http://arxiv.org/abs/2605.07630v1
Date: Fri, 08 May 2026 11:58:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:39.024587
Title: Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents
Title（参考訳）: 安全か、それとも単に不可能か : 携帯電話利用エージェントの安全性評価を再考する
Authors: Zhengyang Tang, Yi Zhang, Chenxin Li, Xin Lai, Pengyuan Lyu, Yiduo Guo, Weinong Wang, Junyi Li, Yang Ding, Huawen Shen, Zhengyao Fang, Xingran Zhou, Liang Wu, Fei Tang, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Chengquan Zhang, Han Hu,
Abstract要約: 電話使用エージェントが危害を免れた場合、安全を示すか、単に行動できないか? 有害な結果は、エージェントがリスクを認識して安全なアクションを選択したり、スクリーンを理解したり、関連するアクションを全く実行できなかったりすることで回避される。私たちはPhoneSafetyでこの問題に対処しています。これは130以上のアプリにわたる実際の電話インタラクションから引き出された700の安全クリティカルな瞬間のベンチマークです。
参考スコア（独自算出の注目度）: 73.69976712292681
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell. A harmful outcome may be avoided because the agent recognized the risk and chose the safe action, or because it failed to understand the screen or execute any relevant action at all. These cases have different causes and call for different fixes, yet current benchmarks often merge them under task success, refusal, or final harmful outcome. We address this problem with PhoneSafety, a benchmark of 700 safety-critical moments drawn from real phone interactions across more than 130 apps. Each instance isolates the next decision at a risky moment and asks a simple question: does the model take the safe action, take the unsafe action, or fail to do anything useful? We evaluate eight representative phone-use agents under this framework. Our results reveal two main patterns. First, stronger general phone-use ability does not reliably imply safer choices at risky moments. Models that perform better on ordinary app tasks are not always the ones that behave more safely when the next action matters. Second, failures to do anything useful behave like a capability signal rather than a safety signal: they are concentrated in more visually and operationally demanding settings and remain stable when the evaluation protocol changes. Across models, failures split into two recurring patterns: unsafe choices in settings where the model can act but chooses wrongly, and inability to act in more visually and operationally demanding screens. Overall, a harmless outcome is not enough to count as evidence of safety. Evaluating phone-use agents requires separating unsafe judgment from inability to act.
Abstract（参考訳）: 電話使用エージェントが危害を免れた場合、安全を示すか、単に行動できないか? 既存の評価ではよくわからない。有害な結果は、エージェントがリスクを認識して安全なアクションを選択したり、スクリーンを理解したり、関連するアクションを全く実行できなかったりすることで回避される。これらのケースは異なる原因を持ち、異なる修正を要求するが、現在のベンチマークはタスクの成功、拒絶、最終的な有害な結果の下でそれらをマージすることが多い。私たちはPhoneSafetyでこの問題に対処しています。これは130以上のアプリにわたる実際の電話インタラクションから引き出された700の安全クリティカルな瞬間のベンチマークです。各インスタンスは、リスクのある瞬間に次の決定を分離し、簡単な質問をする: モデルは安全なアクションを取るか、安全でないアクションを取るか、役に立ちませんか? 本フレームワークでは,8つの代表的電話利用エージェントを評価した。結果は2つの主要なパターンを明らかにした。第一に、より強力な一般的な電話使用能力は、危険な瞬間に確実に選択を安全にするものではない。通常のアプリタスクでより良いパフォーマンスを発揮するモデルは、次のアクションが重要であれば、常により安全に振る舞うモデルではありません。第二に、何か有用なことをする失敗は、安全信号ではなく機能信号のように振る舞う。それらは、より視覚的で運用的に要求される設定に集中しており、評価プロトコルが変更されても安定している。モデル全体では、失敗は2つの繰り返し発生するパターンに分けられる。全体としては、無害な成果は安全性の証拠とみなすには十分ではない。通話エージェントの評価には、安全でない判断と行動できない判断を分離する必要がある。

論文の概要: Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

関連論文リスト