MUG: Interactive Multimodal Grounding on User Interfaces
- URL: http://arxiv.org/abs/2209.15099v1
- Date: Thu, 29 Sep 2022 21:08:18 GMT
- Title: MUG: Interactive Multimodal Grounding on User Interfaces
- Authors: Tao Li, Gang Li, Jingjie Zheng, Purple Wang, Yang Li
- Abstract summary: We present MUG, a novel interactive task for multimodal grounding where a user and an agent work collaboratively on an interface screen.
Prior works modeled multimodal UI grounding in one round: the user gives a command and the agent responds to the command. MUG allows multiple rounds of interactions such that upon seeing the agent responses, the user can give further commands for the agent to refine or even correct its actions.
- Score: 12.035123646959669
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present MUG, a novel interactive task for multimodal grounding where a
user and an agent work collaboratively on an interface screen. Prior works
modeled multimodal UI grounding in one round: the user gives a command and the
agent responds to the command. Yet, in a realistic scenario, a user command can
be ambiguous when the target action is inherently difficult to articulate in
natural language. MUG allows multiple rounds of interactions such that upon
seeing the agent responses, the user can give further commands for the agent to
refine or even correct its actions. Such interaction is critical for improving
grounding performances in real-world use cases. To investigate the problem, we
create a new dataset that consists of 77,820 sequences of human user-agent
interaction on mobile interfaces in which 20% involves multiple rounds of
interactions. To establish our benchmark, we experiment with a range of
modeling variants and evaluation strategies, including both offline and online
evaluation-the online strategy consists of both human evaluation and automatic
with simulators. Our experiments show that allowing iterative interaction
significantly improves the absolute task completion by 18% over the entire test
dataset and 31% over the challenging subset. Our results lay the foundation for
further investigation of the problem.
Related papers
- Simulating User Agents for Embodied Conversational-AI [9.402740034754455]
We build a large language model (LLM)-based user agent that can simulate user behavior during interactions with an embodied agent.
We evaluate our user agent's ability to generate human-like behaviors by comparing its simulated dialogues with the TEACh dataset.
arXiv Detail & Related papers (2024-10-31T00:56:08Z) - A Survey on Complex Tasks for Goal-Directed Interactive Agents [60.53915548970061]
This survey compiles relevant tasks and environments for evaluating goal-directed interactive agents.
An up-to-date compilation of relevant resources can be found on our project website.
arXiv Detail & Related papers (2024-09-27T08:17:53Z) - PLAYER*: Enhancing LLM-based Multi-Agent Communication and Interaction in Murder Mystery Games [18.383262467079078]
PLAYER* enhances path planning in Murder Mystery Games (MMGs) using an anytime sampling-based planner and a questioning-driven search framework.
By equipping agents with a set of sensors, PLAYER* eliminates the need for pre-defined questions and enables agents to navigate complex social interactions.
We additionally make a contribution by introducing a quantifiable evaluation method using multiple-choice questions and present WellPlay, a dataset containing 1,482 question-answer pairs.
arXiv Detail & Related papers (2024-04-26T19:07:30Z) - AgentCF: Collaborative Learning with Autonomous Language Agents for
Recommender Systems [112.76941157194544]
We propose AgentCF for simulating user-item interactions in recommender systems through agent-based collaborative filtering.
We creatively consider not only users but also items as agents, and develop a collaborative learning approach that optimize both kinds of agents together.
Overall, the optimized agents exhibit diverse interaction behaviors within our framework, including user-item, user-user, item-item, and collective interactions.
arXiv Detail & Related papers (2023-10-13T16:37:14Z) - You Only Look at Screens: Multimodal Chain-of-Action Agents [37.118034745972956]
Auto-GUI is a multimodal solution that directly interacts with the interface.
We propose a chain-of-action technique to help the agent decide what action to execute.
We evaluate our approach on a new device-control benchmark AITW with 30$K$ unique instructions.
arXiv Detail & Related papers (2023-09-20T16:12:32Z) - Tachikuma: Understading Complex Interactions with Multi-Character and
Novel Objects by Large Language Models [67.20964015591262]
We introduce a benchmark named Tachikuma, comprising a Multiple character and novel Object based interaction Estimation task and a supporting dataset.
The dataset captures log data from real-time communications during gameplay, providing diverse, grounded, and complex interactions for further explorations.
We present a simple prompting baseline and evaluate its performance, demonstrating its effectiveness in enhancing interaction understanding.
arXiv Detail & Related papers (2023-07-24T07:40:59Z) - First Contact: Unsupervised Human-Machine Co-Adaptation via Mutual
Information Maximization [112.40598205054994]
We formalize this idea as a completely unsupervised objective for optimizing interfaces.
We conduct an observational study on 540K examples of users operating various keyboard and eye gaze interfaces for typing, controlling simulated robots, and playing video games.
The results show that our mutual information scores are predictive of the ground-truth task completion metrics in a variety of domains.
arXiv Detail & Related papers (2022-05-24T21:57:18Z) - Multi-Agent Task-Oriented Dialog Policy Learning with Role-Aware Reward
Decomposition [64.06167416127386]
We propose Multi-Agent Dialog Policy Learning, which regards both the system and the user as the dialog agents.
Two agents interact with each other and are jointly learned simultaneously.
Results show that our method can successfully build a system policy and a user policy simultaneously.
arXiv Detail & Related papers (2020-04-08T04:51:40Z) - SPA: Verbal Interactions between Agents and Avatars in Shared Virtual
Environments using Propositional Planning [61.335252950832256]
Sense-Plan-Ask, or SPA, generates plausible verbal interactions between virtual human-like agents and user avatars in shared virtual environments.
We find that our algorithm creates a small runtime cost and enables agents to complete their goals more effectively than agents without the ability to leverage natural-language communication.
arXiv Detail & Related papers (2020-02-08T23:15:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.