WebNav: An Intelligent Agent for Voice-Controlled Web Navigation
- URL: http://arxiv.org/abs/2503.13843v2
- Date: Sun, 06 Jul 2025 08:41:00 GMT
- Title: WebNav: An Intelligent Agent for Voice-Controlled Web Navigation
- Authors: Trisanth Srinivasan, Santosh Patapati,
- Abstract summary: WebNav is a novel agent for multi-modal web navigation.<n>System combines vision-based context from screenshots with a dynamic DOM-labeling browser extension.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The current state of modern web interfaces, especially in regards to accessibility focused usage is extremely lacking. Traditional methods for web interaction, such as scripting languages and screen readers, often lack the flexibility to handle dynamic content or the intelligence to interpret high-level user goals. To address these limitations, we introduce WebNav, a novel agent for multi-modal web navigation. WebNav leverages a dual Large Language Model (LLM) architecture to translate natural language commands into precise, executable actions on a graphical user interface. The system combines vision-based context from screenshots with a dynamic DOM-labeling browser extension to robustly identify interactive elements. A high-level 'Controller' LLM strategizes the next step toward a user's goal, while a second 'Assistant' LLM generates the exact parameters for execution. This separation of concerns allows for sophisticated task decomposition and action formulation. Our work presents the complete architecture and implementation of WebNav, demonstrating a promising approach to creating more intelligent web automation agents.
Related papers
- R2D2: Remembering, Reflecting and Dynamic Decision Making for Web Agents [53.94879482534949]
Current models often struggle with efficient navigation and action execution due to limited visibility and understanding of web structures.<n>Our proposed R2D2 framework addresses these challenges by integrating two paradigms: Remember and Reflect.<n>Our findings suggest that a combination of memory-enhanced navigation and reflective learning promisingly advances the capabilities of web agents.
arXiv Detail & Related papers (2025-01-21T20:21:58Z) - PAFFA: Premeditated Actions For Fast Agents [23.363582411971567]
PAFFA is a framework designed to enhance web interaction capabilities through an Action API Library of reusable, verified browser interaction functions.<n>It reduces inference calls by 87% while maintaining robust performance even as website structures evolve.<n>This framework accelerates multi-page task execution and offers a scalable solution to advance autonomous web agent research.
arXiv Detail & Related papers (2024-12-10T22:51:31Z) - Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping [57.024913536420264]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance on the design-to-code task.<n>We present the first systematic investigation of MLLMs in generating interactive webpages.
arXiv Detail & Related papers (2024-11-05T17:40:03Z) - From Context to Action: Analysis of the Impact of State Representation and Context on the Generalization of Multi-Turn Web Navigation Agents [7.41862656697588]
This study aims to analyze the various contextual elements crucial to the functioning of web navigation agents.
We focus on the influence of interaction history and web page representation.
Our work highlights improved agent performance across out-of-distribution scenarios.
arXiv Detail & Related papers (2024-10-31T01:51:41Z) - Infogent: An Agent-Based Framework for Web Information Aggregation [59.67710556177564]
We introduce Infogent, a novel framework for web information aggregation.
Experiments on different information access settings demonstrate Infogent beats an existing SOTA multi-agent search framework by 7%.
arXiv Detail & Related papers (2024-10-24T18:01:28Z) - Leveraging Large Vision Language Model For Better Automatic Web GUI Testing [7.480576630392405]
This paper proposes VETL, the first LVLM-driven endtoend web testing technique.
With LVLM's scene understanding capabilities, VETL can generate valid and meaningful text inputs focusing on the local context.
The selection of associated GUI elements is formulated as a visual question-answering problem, allowing LVLM to capture the logical connection between the input box and the relevant element.
arXiv Detail & Related papers (2024-10-16T01:37:58Z) - Steward: Natural Language Web Automation [19.301371856154965]
Large language models (LLMs) have demonstrated exceptional capabilities in serving as the foundation for AI assistants.
We introduce Steward, a novel LLM-powered web automation tool designed to serve as a cost-effective, scalable, end-to-end solution for automating web interactions.
We discuss various design and implementation challenges, including state representation, action sequence selection, system responsiveness, detecting task completion, and caching implementation.
arXiv Detail & Related papers (2024-09-23T18:06:32Z) - NaviQAte: Functionality-Guided Web Application Navigation [6.0759036120654315]
NaviQAte frames web application exploration as a question-and-answer task, generating action sequences for functionalities without requiring detailed parameters.
Our three-phase approach utilizes advanced large language models like GPT-4o for complex decision-making and cost-effective models, such as GPT-4o mini, for simpler tasks.
arXiv Detail & Related papers (2024-09-16T21:18:39Z) - DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research.
We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair.
We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z) - ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning [74.58666091522198]
We present a framework for intuitive robot programming by non-experts.
We leverage natural language prompts and contextual information from the Robot Operating System (ROS)
Our system integrates large language models (LLMs), enabling non-experts to articulate task requirements to the system through a chat interface.
arXiv Detail & Related papers (2024-06-28T08:28:38Z) - On the Multi-turn Instruction Following for Conversational Web Agents [83.51251174629084]
We introduce a new task of Conversational Web Navigation, which necessitates sophisticated interactions that span multiple turns with both the users and the environment.
We propose a novel framework, named self-reflective memory-augmented planning (Self-MAP), which employs memory utilization and self-reflection techniques.
arXiv Detail & Related papers (2024-02-23T02:18:12Z) - Large Language User Interfaces: Voice Interactive User Interfaces powered by LLMs [5.06113628525842]
We present a framework that can serve as an intermediary between a user and their user interface (UI)
We employ a system that stands upon textual semantic mappings of UI components, in the form of annotations.
Our engine can classify the most appropriate application, extract relevant parameters, and subsequently execute precise predictions of the user's expected actions.
arXiv Detail & Related papers (2024-02-07T21:08:49Z) - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks.
To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z) - AllTogether: Investigating the Efficacy of Spliced Prompt for Web
Navigation using Large Language Models [2.234037966956278]
We introduce AllTogether, a standardized prompt template that enhances task context representation.
We evaluate the efficacy of this approach through prompt learning and instruction finetuning based on open-source Llama-2 and API-accessible GPT models.
arXiv Detail & Related papers (2023-10-20T11:10:14Z) - Think, Act, and Ask: Open-World Interactive Personalized Robot Navigation [17.279875204729553]
Zero-Shot Object Navigation (ZSON) enables agents to navigate towards open-vocabulary objects in unknown environments.
We introduce ZIPON, where robots need to navigate to personalized goal objects while engaging in conversations with users.
We propose Open-woRld Interactive persOnalized Navigation (ORION) to make sequential decisions to manipulate different modules for perception, navigation and communication.
arXiv Detail & Related papers (2023-10-12T01:17:56Z) - ETPNav: Evolving Topological Planning for Vision-Language Navigation in
Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments.
We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments.
ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z) - AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments [60.98664330268192]
We present AVLEN -- an interactive agent for Audio-Visual-Language Embodied Navigation.
The goal of AVLEN is to localize an audio event via navigating the 3D visual world.
To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone.
arXiv Detail & Related papers (2022-10-14T16:35:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.