MUG: Interactive Multimodal Grounding on User Interfaces
- URL: http://arxiv.org/abs/2209.15099v1
- Date: Thu, 29 Sep 2022 21:08:18 GMT
- Title: MUG: Interactive Multimodal Grounding on User Interfaces
- Authors: Tao Li, Gang Li, Jingjie Zheng, Purple Wang, Yang Li
- Abstract summary: We present MUG, a novel interactive task for multimodal grounding where a user and an agent work collaboratively on an interface screen.
Prior works modeled multimodal UI grounding in one round: the user gives a command and the agent responds to the command. MUG allows multiple rounds of interactions such that upon seeing the agent responses, the user can give further commands for the agent to refine or even correct its actions.
- Score: 12.035123646959669
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present MUG, a novel interactive task for multimodal grounding where a
user and an agent work collaboratively on an interface screen. Prior works
modeled multimodal UI grounding in one round: the user gives a command and the
agent responds to the command. Yet, in a realistic scenario, a user command can
be ambiguous when the target action is inherently difficult to articulate in
natural language. MUG allows multiple rounds of interactions such that upon
seeing the agent responses, the user can give further commands for the agent to
refine or even correct its actions. Such interaction is critical for improving
grounding performances in real-world use cases. To investigate the problem, we
create a new dataset that consists of 77,820 sequences of human user-agent
interaction on mobile interfaces in which 20% involves multiple rounds of
interactions. To establish our benchmark, we experiment with a range of
modeling variants and evaluation strategies, including both offline and online
evaluation-the online strategy consists of both human evaluation and automatic
with simulators. Our experiments show that allowing iterative interaction
significantly improves the absolute task completion by 18% over the entire test
dataset and 31% over the challenging subset. Our results lay the foundation for
further investigation of the problem.
Related papers
- Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input [54.81155589931697]
We propose a new task, Collaborative Instance Navigation (CoIN), with dynamic agent-human interaction during navigation.
To address CoIN, we propose a novel method, Agent-user Interaction with UncerTainty Awareness (AIUTA)
AIUTA achieves competitive performance in instance navigation against state-of-the-art methods, demonstrating great flexibility in handling user inputs.
arXiv Detail & Related papers (2024-12-02T08:16:38Z) - Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping [57.024913536420264]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance on the design-to-code task.
We present the first systematic investigation of MLLMs in generating interactive webpages.
arXiv Detail & Related papers (2024-11-05T17:40:03Z) - Simulating User Agents for Embodied Conversational-AI [9.402740034754455]
We build a large language model (LLM)-based user agent that can simulate user behavior during interactions with an embodied agent.
We evaluate our user agent's ability to generate human-like behaviors by comparing its simulated dialogues with the TEACh dataset.
arXiv Detail & Related papers (2024-10-31T00:56:08Z) - AgentCF: Collaborative Learning with Autonomous Language Agents for
Recommender Systems [112.76941157194544]
We propose AgentCF for simulating user-item interactions in recommender systems through agent-based collaborative filtering.
We creatively consider not only users but also items as agents, and develop a collaborative learning approach that optimize both kinds of agents together.
Overall, the optimized agents exhibit diverse interaction behaviors within our framework, including user-item, user-user, item-item, and collective interactions.
arXiv Detail & Related papers (2023-10-13T16:37:14Z) - You Only Look at Screens: Multimodal Chain-of-Action Agents [37.118034745972956]
Auto-GUI is a multimodal solution that directly interacts with the interface.
We propose a chain-of-action technique to help the agent decide what action to execute.
We evaluate our approach on a new device-control benchmark AITW with 30$K$ unique instructions.
arXiv Detail & Related papers (2023-09-20T16:12:32Z) - Tachikuma: Understading Complex Interactions with Multi-Character and
Novel Objects by Large Language Models [67.20964015591262]
We introduce a benchmark named Tachikuma, comprising a Multiple character and novel Object based interaction Estimation task and a supporting dataset.
The dataset captures log data from real-time communications during gameplay, providing diverse, grounded, and complex interactions for further explorations.
We present a simple prompting baseline and evaluate its performance, demonstrating its effectiveness in enhancing interaction understanding.
arXiv Detail & Related papers (2023-07-24T07:40:59Z) - First Contact: Unsupervised Human-Machine Co-Adaptation via Mutual
Information Maximization [112.40598205054994]
We formalize this idea as a completely unsupervised objective for optimizing interfaces.
We conduct an observational study on 540K examples of users operating various keyboard and eye gaze interfaces for typing, controlling simulated robots, and playing video games.
The results show that our mutual information scores are predictive of the ground-truth task completion metrics in a variety of domains.
arXiv Detail & Related papers (2022-05-24T21:57:18Z) - Effects of Naturalistic Variation in Goal-Oriented Dialog [12.49850843793842]
We investigate the impact of naturalistic variation on two goal-oriented datasets: bAbI dialog task and Stanford Multi-Domain dataset.
We propose new and more effective testbeds for both datasets, by introducing naturalistic variation by the user.
arXiv Detail & Related papers (2020-10-05T18:13:45Z) - Multi-Agent Task-Oriented Dialog Policy Learning with Role-Aware Reward
Decomposition [64.06167416127386]
We propose Multi-Agent Dialog Policy Learning, which regards both the system and the user as the dialog agents.
Two agents interact with each other and are jointly learned simultaneously.
Results show that our method can successfully build a system policy and a user policy simultaneously.
arXiv Detail & Related papers (2020-04-08T04:51:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.