Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
- URL: http://arxiv.org/abs/2108.03353v1
- Date: Sat, 7 Aug 2021 03:01:23 GMT
- Title: Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
- Authors: Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, Yang Li
- Abstract summary: Mobile User Interface Summarization generates succinct language descriptions of mobile screens for conveying important contents and functionalities of the screen.
We present Screen2Words, a novel screen summarization approach that automatically encapsulates essential information of a UI screen into a coherent language phrase.
- Score: 34.24671403624908
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mobile User Interface Summarization generates succinct language descriptions
of mobile screens for conveying important contents and functionalities of the
screen, which can be useful for many language-based application scenarios. We
present Screen2Words, a novel screen summarization approach that automatically
encapsulates essential information of a UI screen into a coherent language
phrase. Summarizing mobile screens requires a holistic understanding of the
multi-modal data of mobile UIs, including text, image, structures as well as UI
semantics, motivating our multi-modal learning approach. We collected and
analyzed a large-scale screen summarization dataset annotated by human workers.
Our dataset contains more than 112k language summarization across $\sim$22k
unique UI screens. We then experimented with a set of deep models with
different configurations. Our evaluation of these models with both automatic
accuracy metrics and human rating shows that our approach can generate
high-quality summaries for mobile screens. We demonstrate potential use cases
of Screen2Words and open-source our dataset and model to lay the foundations
for further bridging language and user interfaces.
Related papers
- Harnessing Webpage UIs for Text-Rich Visual Understanding [112.01029887404296]
We propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs)
These instructions are then paired with UI screenshots to train multimodal models.
We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts.
arXiv Detail & Related papers (2024-10-17T17:48:54Z) - Tell Me What's Next: Textual Foresight for Generic UI Representations [65.10591722192609]
We propose Textual Foresight, a novel pretraining objective for learning UI screen representations.
Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken.
We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning.
arXiv Detail & Related papers (2024-06-12T02:43:19Z) - Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset [50.09271028495819]
multimodal research related to touch focuses on visual and tactile modalities.
We construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration.
arXiv Detail & Related papers (2024-03-14T19:01:54Z) - ScreenAI: A Vision-Language Model for UI and Infographics Understanding [4.914575630736291]
We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding.
At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements.
We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale.
arXiv Detail & Related papers (2024-02-07T06:42:33Z) - UReader: Universal OCR-free Visually-situated Language Understanding
with Multimodal Large Language Model [108.85584502396182]
We propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM)
By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters.
Our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks.
arXiv Detail & Related papers (2023-10-08T11:33:09Z) - ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine
Conversations [13.939350184164017]
Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language.
We adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM)
We generate a dataset of 335K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks.
arXiv Detail & Related papers (2023-10-07T16:32:34Z) - TextMI: Textualize Multimodal Information for Integrating Non-verbal
Cues in Pre-trained Language Models [5.668457303716451]
We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks.
Our approach significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks.
arXiv Detail & Related papers (2023-03-27T17:54:32Z) - Enabling Conversational Interaction with Mobile UI using Large Language
Models [15.907868408556885]
To perform diverse UI tasks with natural language, developers typically need to create separate datasets and models for each specific task.
This paper investigates the feasibility of enabling versatile conversational interactions with mobile UIs using a single language model.
arXiv Detail & Related papers (2022-09-18T20:58:39Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - Widget Captioning: Generating Natural Language Description for Mobile
User Interface Elements [17.383434668094075]
We propose widget captioning, a novel task for automatically generating language descriptions for user interface elements.
Our dataset contains 162,859 language phrases created by human workers for annotating 61,285 UI elements.
arXiv Detail & Related papers (2020-10-08T22:56:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.