Fugu-MT 論文翻訳(概要): OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability

論文の概要: OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability

arxiv url: http://arxiv.org/abs/2511.20766v1
Date: Tue, 25 Nov 2025 19:00:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-27 18:37:58.814153
Title: OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability
Title（参考訳）: OpenApps: UI-Agentの信頼性を計測するために環境変動をシミュレートする
Authors: Karen Ullrich, Jingtong Su, Claudia Shi, Arjun Subramonian, Amir Bar, Ivan Evtimov, Nikolaos Tsilivis, Randall Balestriero, Julia Kempe, Mark Ibrahim,
Abstract要約: 自律的なUIエージェントの約束を実現する上では、信頼性が重要です。 6つのアプリを備えた軽量なオープンソースエコシステムであるOpenAppsを開発しています。我々は、7つの主要なマルチモーダルエージェントの信頼性を研究するために、1万以上の独立した評価を実行する。
参考スコア（独自算出の注目度）: 49.99934595922838
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reliability is key to realizing the promise of autonomous UI-Agents, multimodal agents that directly interact with apps in the same manner as humans, as users must be able to trust an agent to complete a given task. Current evaluations rely on fixed environments, often clones of existing apps, which are limited in that they can only shed light on whether or how often an agent can complete a task within a specific environment. When deployed however, agents are likely to encounter variations in app design and content that can affect an agent's ability to complete a task. To address this blind spot of measuring agent reliability across app variations, we develop OpenApps, a light-weight open-source ecosystem with six apps (messenger, calendar, maps, etc.) that are configurable in appearance and content. OpenApps requires just a single CPU to run, enabling easy generation and deployment of thousands of versions of each app. Specifically, we run more than 10,000 independent evaluations to study reliability across seven leading multimodal agents. We find that while standard reliability within a fixed app is relatively stable, reliability can vary drastically when measured across app variations. Task success rates for many agents can fluctuate by more than $50\%$ across app variations. For example, Kimi-VL-3B's average success across all tasks fluctuates from $63\%$ to just $4\%$ across app versions. We also find agent behaviors such as looping or hallucinating actions can differ drastically depending on the environment configuration. These initial findings highlight the importance of measuring reliability along this new dimension of app variations. OpenApps is available at https://facebookresearch.github.io/OpenApps/
Abstract（参考訳）: 信頼性は、ユーザーが与えられたタスクを完了するためにエージェントを信頼する必要があるため、人間と同じ方法でアプリと直接対話する、自律的なUIエージェントであるマルチモーダルエージェントの約束を実現するための鍵である。現在の評価は、しばしば既存のアプリのクローンである固定された環境に依存しており、エージェントが特定の環境内でタスクを完了できる頻度にのみ光を当てることが制限されている。しかし、デプロイされると、エージェントは、タスクを完了させるエージェントの能力に影響を与える可能性のあるアプリ設計やコンテンツの変化に遭遇する可能性が高い。アプリのバリエーションによってエージェントの信頼性を測定するこの盲点に対処するため、私たちは、外観とコンテンツで設定可能な6つのアプリ(メッセンジャー、カレンダ、マップなど)を備えた軽量のオープンソースエコシステムであるOpenAppsを開発しました。 OpenAppsは1つのCPUで実行でき、各アプリの数千バージョンを簡単に生成およびデプロイできる。具体的には、7つの主要なマルチモーダルエージェントにまたがって1万以上の独立した評価を行い、信頼性を調査する。固定アプリ内の標準的な信頼性は比較的安定しているが、アプリのさまざまなバリエーションを測定すると、信頼性が大きく変化する可能性がある。多くのエージェントのタスク成功率は、アプリのバリエーションによって50\%以上変動する。例えば、すべてのタスクにおけるKim-VL-3Bの平均的な成功は、アプリケーションバージョン全体で63\%から4\%に変動します。また,ループや幻覚行動などのエージェントの挙動は,環境構成によって大きく異なる場合がある。これらの最初の発見は、新しいアプリのバリエーションの次元に沿って信頼性を測定することの重要性を強調した。 OpenAppsはhttps://facebookresearch.github.io/OpenApps/で入手できる。

論文の概要: OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability

関連論文リスト