Understanding Mobile GUI: from Pixel-Words to Screen-Sentences
- URL: http://arxiv.org/abs/2105.11941v1
- Date: Tue, 25 May 2021 13:45:54 GMT
- Title: Understanding Mobile GUI: from Pixel-Words to Screen-Sentences
- Authors: Jingwen Fu, Xiaoyi Zhang, Yuwang Wang, Wenjun Zeng, Sam Yang and
Grayson Hilliard
- Abstract summary: We propose a mobile GUI understanding architecture: Pixel-Words to Screen-Sentence (PW2SS)
Pixel-Words are defined as atomic visual components, which are visually consistent and semantically clear across screenshots.
We are able to make use of metadata available in training data to auto-generate high-quality annotations for Pixel-Words.
- Score: 48.97215653702567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ubiquity of mobile phones makes mobile GUI understanding an important
task. Most previous works in this domain require human-created metadata of
screens (e.g. View Hierarchy) during inference, which unfortunately is often
not available or reliable enough for GUI understanding. Inspired by the
impressive success of Transformers in NLP tasks, targeting for purely
vision-based GUI understanding, we extend the concepts of Words/Sentence to
Pixel-Words/Screen-Sentence, and propose a mobile GUI understanding
architecture: Pixel-Words to Screen-Sentence (PW2SS). In analogy to the
individual Words, we define the Pixel-Words as atomic visual components (text
and graphic components), which are visually consistent and semantically clear
across screenshots of a large variety of design styles. The Pixel-Words
extracted from a screenshot are aggregated into Screen-Sentence with a Screen
Transformer proposed to model their relations. Since the Pixel-Words are
defined as atomic visual components, the ambiguity between their visual
appearance and semantics is dramatically reduced. We are able to make use of
metadata available in training data to auto-generate high-quality annotations
for Pixel-Words. A dataset, RICO-PW, of screenshots with Pixel-Words
annotations is built based on the public RICO dataset, which will be released
to help to address the lack of high-quality training data in this area. We
train a detector to extract Pixel-Words from screenshots on this dataset and
achieve metadata-free GUI understanding during inference. We conduct
experiments and show that Pixel-Words can be well extracted on RICO-PW and well
generalized to a new dataset, P2S-UI, collected by ourselves. The effectiveness
of PW2SS is further verified in the GUI understanding tasks including relation
prediction, clickability prediction, screen retrieval, and app type
classification.
Related papers
- OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - From Pixels to Prose: A Large Dataset of Dense Image Captions [76.97493750144812]
PixelProse is a comprehensive dataset of over 16M (million) synthetically generated captions.
To ensure data integrity, we rigorously analyze our dataset for problematic content.
We also provide valuable metadata such as watermark presence and aesthetic scores.
arXiv Detail & Related papers (2024-06-14T17:59:53Z) - IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks [124.90137528319273]
In this paper, we present IMProv, a generative model that is able to in-context learn visual tasks from multimodal prompts.
We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions.
During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output.
arXiv Detail & Related papers (2023-12-04T09:48:29Z) - Context Does Matter: End-to-end Panoptic Narrative Grounding with
Deformable Attention Refined Matching Network [25.511804582983977]
Panoramic Narrative Grounding (PNG) aims to segment visual objects in images based on dense narrative captions.
We propose a novel learning framework called Deformable Attention Refined Matching Network (DRMN)
DRMN iteratively re-encodes pixels with the deformable attention network after updating the feature representation of the top-$k$ most similar pixels.
arXiv Detail & Related papers (2023-10-25T13:12:39Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal
Transformers [46.275416873403614]
We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding.
Our approach makes the most state-of-the-arts in downstream tasks, including Visual Question Answering (VQA), image-text retrieval, Natural Language for Visual Reasoning for Real (NLVR)
arXiv Detail & Related papers (2020-04-02T07:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.