Mapping Natural Language Instructions to Mobile UI Action Sequences
- URL: http://arxiv.org/abs/2005.03776v2
- Date: Fri, 5 Jun 2020 02:11:56 GMT
- Title: Mapping Natural Language Instructions to Mobile UI Action Sequences
- Authors: Yang Li and Jiacong He and Xin Zhou and Yuan Zhang and Jason Baldridge
- Abstract summary: We present a new problem: grounding natural language instructions to mobile user interface actions.
We create PIXELHELP, a corpus that pairs English instructions with actions performed by people on a mobile UI emulator.
To scale training, we decouple the language and action data by (a) annotating action phrase spans in HowTo instructions and (b) synthesizing grounded descriptions of actions for mobile user interfaces.
- Score: 17.393816815196974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a new problem: grounding natural language instructions to mobile
user interface actions, and create three new datasets for it. For full task
evaluation, we create PIXELHELP, a corpus that pairs English instructions with
actions performed by people on a mobile UI emulator. To scale training, we
decouple the language and action data by (a) annotating action phrase spans in
HowTo instructions and (b) synthesizing grounded descriptions of actions for
mobile user interfaces. We use a Transformer to extract action phrase tuples
from long-range natural language instructions. A grounding Transformer then
contextually represents UI objects using both their content and screen position
and connects them to object descriptions. Given a starting screen and
instruction, our model achieves 70.59% accuracy on predicting complete
ground-truth action sequences in PIXELHELP.
Related papers
- LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning [19.801187860991117]
This work introduces LaMP, a novel Language-Motion Pretraining model.
LaMP generates motion-informative text embeddings, significantly enhancing the relevance and semantics of generated motion sequences.
For captioning, we finetune a large language model with the language-informative motion features to develop a strong motion captioning model.
arXiv Detail & Related papers (2024-10-09T17:33:03Z) - Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - Training a Vision Language Model as Smartphone Assistant [1.3654846342364308]
We present a visual language model (VLM) that can fulfill diverse tasks on mobile devices.
Our model functions by interacting solely with the user interface (UI)
Unlike previous methods, our model operates not only on a single screen image but on vision-language sentences created from sequences of past screenshots.
arXiv Detail & Related papers (2024-04-12T18:28:44Z) - MotionScript: Natural Language Descriptions for Expressive 3D Human Motions [8.050271017133076]
MotionScript is a motion-to-text conversion algorithm and natural language representation for human body motions.
Our experiments demonstrate that MotionScript descriptions, when applied to text-to-motion tasks, enable large language models to generate complex, previously unseen motions.
arXiv Detail & Related papers (2023-12-19T22:33:17Z) - Android in the Wild: A Large-Scale Dataset for Android Device Control [4.973591165982018]
We present a dataset for device-control research, Android in the Wild (AITW)
The dataset contains human demonstrations of device interactions, including the screens and actions, and corresponding natural language instructions.
It consists of 715k episodes spanning 30k unique instructions, four versions of Android (v10-13),and eight device types (Pixel 2 XL to Pixel 6) with varying screen resolutions.
arXiv Detail & Related papers (2023-07-19T15:57:24Z) - Goal Representations for Instruction Following: A Semi-Supervised
Language Interface to Control [58.06223121654735]
We show a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data.
Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image.
We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data.
arXiv Detail & Related papers (2023-06-30T20:09:39Z) - Natural Language Robot Programming: NLP integrated with autonomous
robotic grasping [1.7045152415056037]
We present a grammar-based natural language framework for robot programming, specifically for pick-and-place tasks.
Our approach uses a custom dictionary of action words, designed to store together words that share meaning.
We validate our framework through simulation and real-world experimentation, using a Franka Panda robotic arm.
arXiv Detail & Related papers (2023-04-06T11:06:30Z) - TEACH: Temporal Action Composition for 3D Humans [50.97135662063117]
Given a series of natural language descriptions, our task is to generate 3D human motions that correspond semantically to the text.
In particular, our goal is to enable the synthesis of a series of actions, which we refer to as temporal action composition.
arXiv Detail & Related papers (2022-09-09T00:33:40Z) - LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories.
We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z) - VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation [11.92150014766458]
We aim to fill the blank of the last mile of embodied agents -- object manipulation by following human guidance.
We build a Vision-and-Language Manipulation benchmark (VLMbench) based on it, containing various language instructions on categorized robotic manipulation tasks.
modular rule-based task templates are created to automatically generate robot demonstrations with language instructions.
arXiv Detail & Related papers (2022-06-17T03:07:18Z) - Episodic Transformer for Vision-and-Language Navigation [142.6236659368177]
This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions.
We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.
Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
arXiv Detail & Related papers (2021-05-13T17:51:46Z) - ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in
Dynamic Environments [85.81157224163876]
We combine Vision-and-Language Navigation, assembling of collected objects, and object referring expression comprehension, to create a novel joint navigation-and-assembly task, named ArraMon.
During this task, the agent is asked to find and collect different target objects one-by-one by navigating based on natural language instructions in a complex, realistic outdoor environment.
We present results for several baseline models (integrated and biased) and metrics (nDTW, CTC, rPOD, and PTC), and the large model-human performance gap demonstrates that our task is challenging and presents a wide scope for future work.
arXiv Detail & Related papers (2020-11-15T23:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.