UI Layers Group Detector: Grouping UI Layers via Text Fusion and Box
Attention
- URL: http://arxiv.org/abs/2212.03440v1
- Date: Wed, 7 Dec 2022 03:50:20 GMT
- Title: UI Layers Group Detector: Grouping UI Layers via Text Fusion and Box
Attention
- Authors: Shuhong Xiao, Tingting Zhou, Yunnong Chen, Dengming Zhang, Liuqing
Chen, Lingyun Sun, Shiyu Yue
- Abstract summary: We propose a vision-based method that automatically detects images (i.e., basic shapes and visual elements) and text layers that present the same semantic meanings.
We construct a large-scale UI dataset for training and testing, and present a data augmentation approach to boost the detection performance.
- Score: 7.614630088064978
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Graphic User Interface (GUI) is facing great demand with the popularization
and prosperity of mobile apps. Automatic UI code generation from UI design
draft dramatically simplifies the development process. However, the nesting
layer structure in the design draft affects the quality and usability of the
generated code. Few existing GUI automated techniques detect and group the
nested layers to improve the accessibility of generated code. In this paper, we
proposed our UI Layers Group Detector as a vision-based method that
automatically detects images (i.e., basic shapes and visual elements) and text
layers that present the same semantic meanings. We propose two plug-in
components, text fusion and box attention, that utilize text information from
design drafts as a priori information for group localization. We construct a
large-scale UI dataset for training and testing, and present a data
augmentation approach to boost the detection performance. The experiment shows
that the proposed method achieves a decent accuracy regarding layers grouping.
Related papers
- ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Hierarchical Graph Interaction Transformer with Dynamic Token Clustering for Camouflaged Object Detection [57.883265488038134]
We propose a hierarchical graph interaction network termed HGINet for camouflaged object detection.
The network is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features.
Our experiments demonstrate the superior performance of HGINet compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2024-08-27T12:53:25Z) - Tell Me What's Next: Textual Foresight for Generic UI Representations [65.10591722192609]
We propose Textual Foresight, a novel pretraining objective for learning UI screen representations.
Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken.
We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning.
arXiv Detail & Related papers (2024-06-12T02:43:19Z) - UI Semantic Group Detection: Grouping UI Elements with Similar Semantics
in Mobile Graphical User Interface [10.80156450091773]
Existing studies on UI elements grouping mainly focus on a single UI-related software engineering task, and their groups vary in appearance and function.
We propose our semantic component groups that pack adjacent text and non-text elements with similar semantics.
To recognize semantic component groups on a UI page, we propose a robust, deep learning-based vision detector, UISCGD.
arXiv Detail & Related papers (2024-03-08T01:52:44Z) - EGFE: End-to-end Grouping of Fragmented Elements in UI Designs with
Multimodal Learning [10.885275494978478]
Grouping fragmented elements can greatly improve the readability and maintainability of the generated code.
Current methods employ a two-stage strategy that introduces hand-crafted rules to group fragmented elements.
We propose EGFE, a novel method for automatically End-to-end Grouping Fragmented Elements via UI sequence prediction.
arXiv Detail & Related papers (2023-09-18T15:28:12Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Adaptively Clustering Neighbor Elements for Image-Text Generation [78.82346492527425]
We propose a novel Transformer-based image-to-text generation model termed as textbfACF.
ACF adaptively clusters vision patches into object regions and language words into phrases to implicitly learn object-phrase alignments.
Experiment results demonstrate the effectiveness of ACF, which outperforms most SOTA captioning and VQA models.
arXiv Detail & Related papers (2023-01-05T08:37:36Z) - ULDGNN: A Fragmented UI Layer Detector Based on Graph Neural Networks [7.614630088064978]
fragmented layers could degrade the code quality without being merged into a whole part if all of them are involved in the code generation.
In this paper, we propose a pipeline to merge fragmented layers automatically.
Our approach can retrieve most fragmented layers in UI design drafts, and achieve 87% accuracy in the detection task.
arXiv Detail & Related papers (2022-08-13T14:14:37Z) - UI Layers Merger: Merging UI layers via Visual Learning and Boundary
Prior [7.251022347055101]
fragmented layers inevitably appear in the UI design drafts which greatly reduces the quality of code generation.
We propose UI Layers Merger (UILM), a vision-based method, which can automatically detect and merge fragmented layers into UI components.
arXiv Detail & Related papers (2022-06-18T16:09:28Z) - VINS: Visual Search for Mobile User Interface Design [66.28088601689069]
This paper introduces VINS, a visual search framework, that takes as input a UI image and retrieves visually similar design examples.
The framework achieves a mean Average Precision of 76.39% for the UI detection and high performance in querying similar UI designs.
arXiv Detail & Related papers (2021-02-10T01:46:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.