Fugu-MT 論文翻訳(概要): WebSight: A Vision-First Architecture for Robust Web Agents

論文の概要: WebSight: A Vision-First Architecture for Robust Web Agents

arxiv url: http://arxiv.org/abs/2508.16987v1
Date: Sat, 23 Aug 2025 11:02:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-26 18:43:45.284978
Title: WebSight: A Vision-First Architecture for Robust Web Agents
Title（参考訳）: WebSight:ロバストなWebエージェントのためのビジョンファーストアーキテクチャ
Authors: Tanvir Bhathal, Asanshay Gupta,
Abstract要約: WebSightは視覚的知覚によって純粋にWeb環境と対話するように設計された視覚ベースのWebエージェントである。 UI要素のインタラクションに最適化された視覚言語モデルであるWebSight-7Bを紹介する。 WebSight-7BはShowdown Clicksベンチマークで58.84%のトップ1の精度を達成し、より大規模なジェネラリストモデルを上回った。 WebSightとWebSight-7Bは、解釈可能で堅牢で効率的なビジュアルWebナビゲーションのための新しい標準を確立する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce WebSight, a vision-based autonomous web agent, designed to interact with web environments purely through visual perception, eliminating dependence on HTML or DOM-based inputs. Central to our approach we introduce our new model, WebSight-7B, a fine-tuned vision-language model optimized for UI element interaction, trained using LoRA on a web-focused subset of the Wave-UI-25K dataset. WebSight integrates this model into a modular multi-agent architecture, comprising planning, reasoning, vision-action, and verification agents, coordinated through an episodic memory mechanism. WebSight-7B achieves a top-1 accuracy of 58.84% on the Showdown Clicks benchmark, outperforming several larger generalist models while maintaining lower latency. The full WebSight agent achieves a 68.0% success rate on the WebVoyager benchmark, surpassing systems from labs such as OpenAI (61.0%) and HCompany (Runner H, 67.0%). Among tasks completed, WebSight answers correctly 97.14% of the time, indicating high precision. Together, WebSight and WebSight-7B establish a new standard for interpretable, robust, and efficient visual web navigation.
Abstract（参考訳）: 視覚に基づく自律型WebエージェントであるWebSightを導入し、視覚的知覚によってWeb環境と純粋に対話し、HTMLやDOMベースの入力への依存をなくす。アプローチの中心に、UI要素のインタラクションに最適化された微調整された視覚言語モデルであるWebSight-7Bを導入し、Wave-UI-25KデータセットのWeb中心サブセットでLoRAを使用してトレーニングしました。 WebSightは、このモデルをモジュール型のマルチエージェントアーキテクチャに統合し、計画、推論、ビジョンアクション、検証エージェントをエピソードメモリ機構を介して調整する。 WebSight-7BはShowdown Clicksベンチマークで58.84%というトップ1の精度を達成した。完全なWebSightエージェントは、WebVoyagerベンチマークで68.0%の成功率を獲得し、OpenAI (61.0%) やHCompany (Runner H,67.0%) といった研究所のシステムを上回っている。完了したタスクのうち、WebSightは97.14%を正確に答え、高い精度を示している。 WebSightとWebSight-7Bは共に、解釈可能で堅牢で効率的なビジュアルWebナビゲーションのための新しい標準を確立している。

論文の概要: WebSight: A Vision-First Architecture for Robust Web Agents

関連論文リスト