Related papers: Enabling Conversational Interaction with Mobile UI using Large Language Models

Enabling Conversational Interaction with Mobile UI using Large Language Models

URL: http://arxiv.org/abs/2209.08655v1
Date: Sun, 18 Sep 2022 20:58:39 GMT
Title: Enabling Conversational Interaction with Mobile UI using Large Language Models
Authors: Bryan Wang, Gang Li, Yang Li
Abstract summary: To perform diverse UI tasks with natural language, developers typically need to create separate datasets and models for each specific task. This paper investigates the feasibility of enabling versatile conversational interactions with mobile UIs using a single language model.
Score: 15.907868408556885
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conversational agents show the promise to allow users to interact with mobile devices using language. However, to perform diverse UI tasks with natural language, developers typically need to create separate datasets and models for each specific task, which is expensive and effort-consuming. Recently, pre-trained large language models (LLMs) have been shown capable of generalizing to various downstream tasks when prompted with a handful of examples from the target task. This paper investigates the feasibility of enabling versatile conversational interactions with mobile UIs using a single LLM. We propose a design space to categorize conversations between the user and the agent when collaboratively accomplishing mobile tasks. We design prompting techniques to adapt an LLM to conversational tasks on mobile UIs. The experiments show that our approach enables various conversational interactions with decent performances, manifesting its feasibility. We discuss the use cases of our work and its implications for language-based mobile interaction.

Related papers

SAUCE: Synchronous and Asynchronous User-Customizable Environment for Multi-Agent LLM Interaction [12.948174983519785]
We present SAUCE, a customizable Python platform for group simulations. Our platform takes care of instantiating the models, scheduling their responses, managing the discussion history, and producing a comprehensive output log. A novel feature of SAUCE is our asynchronous communication feature, where models decide when to speak in addition to what to say.
arXiv Detail & Related papers (2024-11-05T18:31:06Z)
Training a Vision Language Model as Smartphone Assistant [1.3654846342364308]
We present a visual language model (VLM) that can fulfill diverse tasks on mobile devices. Our model functions by interacting solely with the user interface (UI) Unlike previous methods, our model operates not only on a single screen image but on vision-language sentences created from sequences of past screenshots.
arXiv Detail & Related papers (2024-04-12T18:28:44Z)
MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments [82.67236400004826]
We introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. MEM module enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities.
arXiv Detail & Related papers (2024-02-01T02:43:20Z)
Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations [70.7884839812069]
Large language models (LLMs) have emerged as powerful and general solutions to many natural language tasks. However, many of the most important applications of language generation are interactive, where an agent has to talk to a person to reach a desired outcome. In this work, we explore a new method for adapting LLMs with RL for such goal-directed dialogue.
arXiv Detail & Related papers (2023-11-09T18:45:16Z)
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing [99.80742991922992]
The system can have multi-turn dialogues with human users by taking multimodal user inputs and generating multimodal responses. LLaVA-Interactive goes beyond language prompt, where visual prompt is enabled to align human intents in the interaction.
arXiv Detail & Related papers (2023-11-01T15:13:43Z)
ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations [13.939350184164017]
Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language. We adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM) We generate a dataset of 335K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks.
arXiv Detail & Related papers (2023-10-07T16:32:34Z)
UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions [64.50935101415776]
We build a single model that jointly performs various spoken language understanding (SLU) tasks. We demonstrate the efficacy of our single multi-task learning model "UniverSLU" for 12 speech classification and sequence generation task types spanning 17 datasets and 9 languages.
arXiv Detail & Related papers (2023-10-04T17:10:23Z)
Unified Human-Scene Interaction via Prompted Chain-of-Contacts [61.87652569413429]
Human-Scene Interaction (HSI) is a vital component of fields like embodied AI and virtual reality. This paper presents a unified HSI framework, UniHSI, which supports unified control of diverse interactions through language commands.
arXiv Detail & Related papers (2023-09-14T17:59:49Z)
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning [24.87615615489849]
We present precise referring instructions that utilize diverse reference representations such as points and boxes as referring prompts to refer to the special region. We propose ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes.
arXiv Detail & Related papers (2023-07-18T17:56:06Z)
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning [34.24671403624908]
Mobile User Interface Summarization generates succinct language descriptions of mobile screens for conveying important contents and functionalities of the screen. We present Screen2Words, a novel screen summarization approach that automatically encapsulates essential information of a UI screen into a coherent language phrase.
arXiv Detail & Related papers (2021-08-07T03:01:23Z)
Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task Feasibility in Interactive Visual Environments [54.405920619915655]
We introduce Mobile app Tasks with Iterative Feedback (MoTIF), a dataset with natural language commands for the greatest number of interactive environments to date. MoTIF is the first to contain natural language requests for interactive environments that are not satisfiable. We perform initial feasibility classification experiments and only reach an F1 score of 37.3, verifying the need for richer vision-language representations.
arXiv Detail & Related papers (2021-04-17T14:48:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.