WebSight: A Vision-First Architecture for Robust Web Agents
- URL: http://arxiv.org/abs/2508.16987v1
- Date: Sat, 23 Aug 2025 11:02:59 GMT
- Title: WebSight: A Vision-First Architecture for Robust Web Agents
- Authors: Tanvir Bhathal, Asanshay Gupta,
- Abstract summary: WebSight is a vision-based web agent designed to interact with web environments purely through visual perception.<n>We introduce WebSight-7B, a fine-tuned vision-language model optimized for UI element interaction.<n>WebSight-7B achieves a top-1 accuracy of 58.84% on the Showdown Clicks benchmark, outperforming several larger generalist models.<n>WebSight and WebSight-7B establish a new standard for interpretable, robust, and efficient visual web navigation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce WebSight, a vision-based autonomous web agent, designed to interact with web environments purely through visual perception, eliminating dependence on HTML or DOM-based inputs. Central to our approach we introduce our new model, WebSight-7B, a fine-tuned vision-language model optimized for UI element interaction, trained using LoRA on a web-focused subset of the Wave-UI-25K dataset. WebSight integrates this model into a modular multi-agent architecture, comprising planning, reasoning, vision-action, and verification agents, coordinated through an episodic memory mechanism. WebSight-7B achieves a top-1 accuracy of 58.84% on the Showdown Clicks benchmark, outperforming several larger generalist models while maintaining lower latency. The full WebSight agent achieves a 68.0% success rate on the WebVoyager benchmark, surpassing systems from labs such as OpenAI (61.0%) and HCompany (Runner H, 67.0%). Among tasks completed, WebSight answers correctly 97.14% of the time, indicating high precision. Together, WebSight and WebSight-7B establish a new standard for interpretable, robust, and efficient visual web navigation.
Related papers
- WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks [35.99528846296261]
WebGym is the largest-to-date open-source environment for training realistic visual web agents.<n>WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites.
arXiv Detail & Related papers (2026-01-05T09:35:11Z) - IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video? [56.33950760097989]
IWR-Bench is a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video.<n>IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions.<n>This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code.
arXiv Detail & Related papers (2025-09-29T12:38:06Z) - WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents [57.203515352080295]
We introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution.<n>Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving.<n>As an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training.
arXiv Detail & Related papers (2025-09-08T10:07:03Z) - Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence [109.32705135051486]
Embodied Web Agents is a novel paradigm for AI agents that fluidly bridge the embodiment and web-scale reasoning.<n>We release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks.<n>Results reveal significant performance gaps between state-of-the-art AI systems and human capabilities.
arXiv Detail & Related papers (2025-06-18T17:58:17Z) - WebGames: Challenging General-Purpose Web-Browsing AI Agents [11.320069795732058]
WebGames is a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents.<n>We evaluate leading vision-language models including GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL against human performance.<n>Results reveal a substantial capability gap, with the best AI system achieving only 43.1% success rate compared to human performance of 95.7%.
arXiv Detail & Related papers (2025-02-25T16:45:08Z) - AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space.<n>AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z) - WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models [65.18602126334716]
Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots.
We introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites.
We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups.
arXiv Detail & Related papers (2024-01-25T03:33:18Z) - Multimodal Web Navigation with Instruction-Finetuned Foundation Models [99.14209521903854]
We study data-driven offline training for web agents with vision-language foundation models.
We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages.
We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning.
arXiv Detail & Related papers (2023-05-19T17:44:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.