WebNav: An Intelligent Agent for Voice-Controlled Web Navigation
- URL: http://arxiv.org/abs/2503.13843v1
- Date: Tue, 18 Mar 2025 02:33:27 GMT
- Title: WebNav: An Intelligent Agent for Voice-Controlled Web Navigation
- Authors: Trisanth Srinivasan, Santosh Patapati,
- Abstract summary: WebNav is a voice-controlled web navigation agent that leverages a ReAct-inspired architecture and generative AI to provide this framework.<n>Preliminary evaluations show that WebNav outperforms traditional screen readers in response time and task completion accuracy for the visually impaired.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increasing reliance on web interfaces presents many challenges for visually impaired users, showcasing the need for more advanced assistive technologies. This paper introduces WebNav, a voice-controlled web navigation agent that leverages a ReAct-inspired architecture and generative AI to provide this framework. WebNav comprises of a hierarchical structure: a Digital Navigation Module (DIGNAV) for high-level strategic planning, an Assistant Module for translating abstract commands into executable actions, and an Inference Module for low-level interaction. A key component is a dynamic labeling engine, implemented as a browser extension, that generates real-time labels for interactive elements, creating mapping between voice commands and Document Object Model (DOM) components. Preliminary evaluations show that WebNav outperforms traditional screen readers in response time and task completion accuracy for the visually impaired. Future work will focus on extensive user evaluations, benchmark development, and refining the agent's adaptive capabilities for real-world deployment.
Related papers
- R2D2: Remembering, Reflecting and Dynamic Decision Making for Web Agents [53.94879482534949]
Current models often struggle with efficient navigation and action execution due to limited visibility and understanding of web structures.<n>Our proposed R2D2 framework addresses these challenges by integrating two paradigms: Remember and Reflect.<n>Our findings suggest that a combination of memory-enhanced navigation and reflective learning promisingly advances the capabilities of web agents.
arXiv Detail & Related papers (2025-01-21T20:21:58Z) - PAFFA: Premeditated Actions For Fast Agents [23.363582411971567]
PAFFA is a framework designed to enhance web interaction capabilities through an Action API Library of reusable, verified browser interaction functions.<n>It reduces inference calls by 87% while maintaining robust performance even as website structures evolve.<n>This framework accelerates multi-page task execution and offers a scalable solution to advance autonomous web agent research.
arXiv Detail & Related papers (2024-12-10T22:51:31Z) - From Context to Action: Analysis of the Impact of State Representation and Context on the Generalization of Multi-Turn Web Navigation Agents [7.41862656697588]
This study aims to analyze the various contextual elements crucial to the functioning of web navigation agents.
We focus on the influence of interaction history and web page representation.
Our work highlights improved agent performance across out-of-distribution scenarios.
arXiv Detail & Related papers (2024-10-31T01:51:41Z) - Infogent: An Agent-Based Framework for Web Information Aggregation [59.67710556177564]
We introduce Infogent, a novel framework for web information aggregation.
Experiments on different information access settings demonstrate Infogent beats an existing SOTA multi-agent search framework by 7%.
arXiv Detail & Related papers (2024-10-24T18:01:28Z) - NaviQAte: Functionality-Guided Web Application Navigation [6.0759036120654315]
NaviQAte frames web application exploration as a question-and-answer task, generating action sequences for functionalities without requiring detailed parameters.
Our three-phase approach utilizes advanced large language models like GPT-4o for complex decision-making and cost-effective models, such as GPT-4o mini, for simpler tasks.
arXiv Detail & Related papers (2024-09-16T21:18:39Z) - DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research.
We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair.
We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z) - Large Language User Interfaces: Voice Interactive User Interfaces powered by LLMs [5.06113628525842]
We present a framework that can serve as an intermediary between a user and their user interface (UI)
We employ a system that stands upon textual semantic mappings of UI components, in the form of annotations.
Our engine can classify the most appropriate application, extract relevant parameters, and subsequently execute precise predictions of the user's expected actions.
arXiv Detail & Related papers (2024-02-07T21:08:49Z) - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks.
To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z) - AllTogether: Investigating the Efficacy of Spliced Prompt for Web
Navigation using Large Language Models [2.234037966956278]
We introduce AllTogether, a standardized prompt template that enhances task context representation.
We evaluate the efficacy of this approach through prompt learning and instruction finetuning based on open-source Llama-2 and API-accessible GPT models.
arXiv Detail & Related papers (2023-10-20T11:10:14Z) - Think, Act, and Ask: Open-World Interactive Personalized Robot Navigation [17.279875204729553]
Zero-Shot Object Navigation (ZSON) enables agents to navigate towards open-vocabulary objects in unknown environments.
We introduce ZIPON, where robots need to navigate to personalized goal objects while engaging in conversations with users.
We propose Open-woRld Interactive persOnalized Navigation (ORION) to make sequential decisions to manipulate different modules for perception, navigation and communication.
arXiv Detail & Related papers (2023-10-12T01:17:56Z) - ETPNav: Evolving Topological Planning for Vision-Language Navigation in
Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments.
We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments.
ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z) - AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments [60.98664330268192]
We present AVLEN -- an interactive agent for Audio-Visual-Language Embodied Navigation.
The goal of AVLEN is to localize an audio event via navigating the 3D visual world.
To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone.
arXiv Detail & Related papers (2022-10-14T16:35:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.