Fugu-MT 論文翻訳(概要): ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

論文の概要: ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

arxiv url: http://arxiv.org/abs/2605.11212v2
Date: Wed, 13 May 2026 16:34:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 17:13:58.876169
Title: ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
Title（参考訳）: ReVision: 時間的視覚冗長化によるコンピュータ利用エージェントのスケーリング
Authors: Amirhossein Abaskohi, Yuhang He, Peter West, Giuseppe Carenini, Pranit Chawla, Vibhav Vineet,
Abstract要約: ReVisionは、冗長な視覚的パッチを削除するトラジェクトリ上で、マルチモーダル言語モデルをトレーニングするために使用される。 ReVisionはトークン使用率を平均で46%削減し,無ドロップベースラインでの成功率を3%向上することを示した。このことは、視覚史において一般的に見られる飽和は、過去の情報の有用性の制限によるものではなく、むしろ非効率なトークン表現の結果によるものであることを示唆している。
参考スコア（独自算出の注目度）: 46.69118032596015
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by approximately 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed. This suggests that the commonly observed saturation in visual history is not due to limited usefulness of past information, but rather a consequence of inefficient token representations.
Abstract（参考訳）: コンピュータ利用エージェント(CUA)はグラフィカルユーザインタフェースの視覚的観察に依存しており、各スクリーンショットは多数のビジュアルトークンにエンコードされる。相互作用軌跡が大きくなるにつれて、トークンコストは急速に増加し、固定されたコンテキストと計算予算の下で組み込むことができる履歴の量を制限する。これにより、他のドメインとは異なり、履歴を使用する場合のパフォーマンスが、あるいは非常に改善された。この非効率性にはReVisionを導入することで対処する。これは、冗長な視覚パッチを除去するトラジェクトリ上でマルチモーダル言語モデルをトレーニングするために、連続するスクリーンショット間でパッチ表現を比較しながら、モデルに必要な空間構造を保存しながら、学習したパッチセレクタを用いて、この非効率性に対処する。 OSWorld、WebTailBench、AgenerNetBenchの3つのベンチマークで、Qwen2.5-VL-7Bを使用して5つの履歴スクリーンショットでトラジェクトリを処理する場合、ReVisionは平均でトークン使用率を46%削減し、無ドロップベースラインよりも3%改善した。これにより明確な効率向上が確立され、より少ないトークンで長いトラジェクトリを処理できるようになる。この効率の改善により、CUAにおける履歴の役割を再考し、冗長性を取り除いた場合に過去の観測が組み込まれ、性能が向上し続けることを確認する。このことは、視覚史において一般的に見られる飽和は、過去の情報の有用性の制限によるものではなく、むしろ非効率なトークン表現の結果によるものであることを示唆している。

論文の概要: ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

関連論文リスト