Referring to Screen Texts with Voice Assistants
- URL: http://arxiv.org/abs/2306.07298v1
- Date: Sat, 10 Jun 2023 22:43:16 GMT
- Title: Referring to Screen Texts with Voice Assistants
- Authors: Shruti Bhargava, Anand Dhoot, Ing-Marie Jonsson, Hoang Long Nguyen,
Alkesh Patel, Hong Yu, Vincent Renkens
- Abstract summary: Our work dives into a new experience for users to refer to phone numbers, addresses, email addresses, URLs, and dates on their phone screens.
Our focus lies in reference understanding, which becomes particularly interesting when multiple similar texts are present on screen.
Due to the high cost of consuming pixels directly, our system is designed to rely on the extracted text from the UI.
- Score: 5.62305568174015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Voice assistants help users make phone calls, send messages, create events,
navigate, and do a lot more. However, assistants have limited capacity to
understand their users' context. In this work, we aim to take a step in this
direction. Our work dives into a new experience for users to refer to phone
numbers, addresses, email addresses, URLs, and dates on their phone screens.
Our focus lies in reference understanding, which becomes particularly
interesting when multiple similar texts are present on screen, similar to
visual grounding. We collect a dataset and propose a lightweight
general-purpose model for this novel experience. Due to the high cost of
consuming pixels directly, our system is designed to rely on the extracted text
from the UI. Our model is modular, thus offering flexibility, improved
interpretability, and efficient runtime memory utilization.
Related papers
- Distilling an End-to-End Voice Assistant Without Instruction Training Data [53.524071162124464]
Distilled Voice Assistant (DiVA) generalizes to Question Answering, Classification, and Translation.
We show that DiVA better meets user preferences, achieving a 72% win rate compared with state-of-the-art models like Qwen 2 Audio.
arXiv Detail & Related papers (2024-10-03T17:04:48Z) - Tell Me What's Next: Textual Foresight for Generic UI Representations [65.10591722192609]
We propose Textual Foresight, a novel pretraining objective for learning UI screen representations.
Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken.
We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning.
arXiv Detail & Related papers (2024-06-12T02:43:19Z) - Training a Vision Language Model as Smartphone Assistant [1.3654846342364308]
We present a visual language model (VLM) that can fulfill diverse tasks on mobile devices.
Our model functions by interacting solely with the user interface (UI)
Unlike previous methods, our model operates not only on a single screen image but on vision-language sentences created from sequences of past screenshots.
arXiv Detail & Related papers (2024-04-12T18:28:44Z) - Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following [59.997857926808116]
We introduce a semantic panel as the decoding in texts to images.
The panel is obtained through arranging the visual concepts parsed from the input text.
We develop a practical system and showcase its potential in continuous generation and chatting-based editing.
arXiv Detail & Related papers (2023-11-28T17:57:44Z) - BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs [101.50522135049198]
BuboGPT is a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language.
Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image.
Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human.
arXiv Detail & Related papers (2023-07-17T15:51:47Z) - Rewriting the Script: Adapting Text Instructions for Voice Interaction [39.54213483588498]
We study the limitations of the dominant approach voice assistants take to complex task guidance.
We propose eight ways in which voice assistants can transform written sources into forms that are readily communicated through spoken conversation.
arXiv Detail & Related papers (2023-06-16T17:43:00Z) - Unsupervised Neural Stylistic Text Generation using Transfer learning
and Adapters [66.17039929803933]
We propose a novel transfer learning framework which updates only $0.3%$ of model parameters to learn style specific attributes for response generation.
We learn style specific attributes from the PERSONALITY-CAPTIONS dataset.
arXiv Detail & Related papers (2022-10-07T00:09:22Z) - Enabling Conversational Interaction with Mobile UI using Large Language
Models [15.907868408556885]
To perform diverse UI tasks with natural language, developers typically need to create separate datasets and models for each specific task.
This paper investigates the feasibility of enabling versatile conversational interactions with mobile UIs using a single language model.
arXiv Detail & Related papers (2022-09-18T20:58:39Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning [34.24671403624908]
Mobile User Interface Summarization generates succinct language descriptions of mobile screens for conveying important contents and functionalities of the screen.
We present Screen2Words, a novel screen summarization approach that automatically encapsulates essential information of a UI screen into a coherent language phrase.
arXiv Detail & Related papers (2021-08-07T03:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.