YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks
- URL: http://arxiv.org/abs/2501.09355v1
- Date: Thu, 16 Jan 2025 08:06:02 GMT
- Title: YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks
- Authors: Saptarashmi Bandyopadhyay, Vikas Bahirwani, Lavisha Aggarwal, Bhanu Guda, Lin Li, Andrea Colaco,
- Abstract summary: Augmented Reality (AR) head worn devices can uniquely improve the user experience of solving procedural day-to-day tasks.
Such AR capabilities can help AI Agents see and listen to actions that users take which can relate to multimodal capabilities of human users.
Proactivity of AI Agents on the other hand can help the human user detect and correct any mistakes in agent observed tasks.
- Score: 16.443149180969776
- License:
- Abstract: Multimodal AI Agents are AI models that have the capability of interactively and cooperatively assisting human users to solve day-to-day tasks. Augmented Reality (AR) head worn devices can uniquely improve the user experience of solving procedural day-to-day tasks by providing egocentric multimodal (audio and video) observational capabilities to AI Agents. Such AR capabilities can help AI Agents see and listen to actions that users take which can relate to multimodal capabilities of human users. Existing AI Agents, either Large Language Models (LLMs) or Multimodal Vision-Language Models (VLMs) are reactive in nature, which means that models cannot take an action without reading or listening to the human user's prompts. Proactivity of AI Agents on the other hand can help the human user detect and correct any mistakes in agent observed tasks, encourage users when they do tasks correctly or simply engage in conversation with the user - akin to a human teaching or assisting a user. Our proposed YET to Intervene (YETI) multimodal agent focuses on the research question of identifying circumstances that may require the agent to intervene proactively. This allows the agent to understand when it can intervene in a conversation with human users that can help the user correct mistakes on tasks, like cooking, using AR. Our YETI Agent learns scene understanding signals based on interpretable notions of Structural Similarity (SSIM) on consecutive video frames. We also define the alignment signal which the AI Agent can learn to identify if the video frames corresponding to the user's actions on the task are consistent with expected actions. These signals are used by our AI Agent to determine when it should proactively intervene. We compare our results on the instances of proactive intervention in the HoloAssist multimodal benchmark for an expert agent guiding a user to complete procedural tasks.
Related papers
- ChatCollab: Exploring Collaboration Between Humans and AI Agents in Software Teams [1.3967206132709542]
ChatCollab's novel architecture allows agents - human or AI - to join collaborations in any role.
Using software engineering as a case study, we find that our AI agents successfully identify their roles and responsibilities.
In relation to three prior multi-agent AI systems for software development, we find ChatCollab AI agents produce comparable or better software in an interactive game development task.
arXiv Detail & Related papers (2024-12-02T21:56:46Z) - Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input [54.81155589931697]
We propose a new task, Collaborative Instance Navigation (CoIN), with dynamic agent-human interaction during navigation.
To address CoIN, we propose a novel method, Agent-user Interaction with UncerTainty Awareness (AIUTA)
AIUTA achieves competitive performance in instance navigation against state-of-the-art methods, demonstrating great flexibility in handling user inputs.
arXiv Detail & Related papers (2024-12-02T08:16:38Z) - CACA Agent: Capability Collaboration based AI Agent [18.84686313298908]
We propose CACA Agent (Capability Collaboration based AI Agent) using an open architecture inspired by service computing.
CACA Agent integrates a set of collaborative capabilities to implement AI Agents, not only reducing the dependence on a single LLM.
We present a demo to illustrate the operation and the application scenario extension of CACA Agent.
arXiv Detail & Related papers (2024-03-22T11:42:47Z) - Tell Me More! Towards Implicit User Intention Understanding of Language
Model Driven Agents [110.25679611755962]
Current language model-driven agents often lack mechanisms for effective user participation, which is crucial given the vagueness commonly found in user instructions.
We introduce Intention-in-Interaction (IN3), a novel benchmark designed to inspect users' implicit intentions through explicit queries.
We empirically train Mistral-Interact, a powerful model that proactively assesses task vagueness, inquires user intentions, and refines them into actionable goals.
arXiv Detail & Related papers (2024-02-14T14:36:30Z) - One Agent Too Many: User Perspectives on Approaches to Multi-agent
Conversational AI [10.825570464035872]
We show that users have a significant preference for abstracting agent orchestration in both system usability and system performance.
We demonstrate that this mode of interaction is able to provide quality responses that are rated within 1% of human-selected answers.
arXiv Detail & Related papers (2024-01-13T17:30:57Z) - Agent AI: Surveying the Horizons of Multimodal Interaction [83.18367129924997]
"Agent AI" is a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data.
We envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.
arXiv Detail & Related papers (2024-01-07T19:11:18Z) - AgentCF: Collaborative Learning with Autonomous Language Agents for
Recommender Systems [112.76941157194544]
We propose AgentCF for simulating user-item interactions in recommender systems through agent-based collaborative filtering.
We creatively consider not only users but also items as agents, and develop a collaborative learning approach that optimize both kinds of agents together.
Overall, the optimized agents exhibit diverse interaction behaviors within our framework, including user-item, user-user, item-item, and collective interactions.
arXiv Detail & Related papers (2023-10-13T16:37:14Z) - ProAgent: Building Proactive Cooperative Agents with Large Language
Models [89.53040828210945]
ProAgent is a novel framework that harnesses large language models to create proactive agents.
ProAgent can analyze the present state, and infer the intentions of teammates from observations.
ProAgent exhibits a high degree of modularity and interpretability, making it easily integrated into various coordination scenarios.
arXiv Detail & Related papers (2023-08-22T10:36:56Z) - Improving Grounded Language Understanding in a Collaborative Environment
by Interacting with Agents Through Help Feedback [42.19685958922537]
We argue that human-AI collaboration should be interactive, with humans monitoring the work of AI agents and providing feedback that the agent can understand and utilize.
In this work, we explore these directions using the challenging task defined by the IGLU competition, an interactive grounded language understanding task in a MineCraft-like world.
arXiv Detail & Related papers (2023-04-21T05:37:59Z) - Conveying Autonomous Robot Capabilities through Contrasting Behaviour
Summaries [8.413049356622201]
We present an adaptive search method for efficiently generating contrasting behaviour summaries.
Our results indicate that adaptive search can efficiently identify informative contrasting scenarios that enable humans to accurately select the better performing agent.
arXiv Detail & Related papers (2023-04-01T18:20:59Z) - Watch-And-Help: A Challenge for Social Perception and Human-AI
Collaboration [116.28433607265573]
We introduce Watch-And-Help (WAH), a challenge for testing social intelligence in AI agents.
In WAH, an AI agent needs to help a human-like agent perform a complex household task efficiently.
We build VirtualHome-Social, a multi-agent household environment, and provide a benchmark including both planning and learning based baselines.
arXiv Detail & Related papers (2020-10-19T21:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.