Related papers: ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations

ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations

URL: http://arxiv.org/abs/2310.04869v1
Date: Sat, 7 Oct 2023 16:32:34 GMT
Title: ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations
Authors: Yue Jiang, Eldon Schoop, Amanda Swearngin, Jeffrey Nichols
Abstract summary: Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language. We adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM) We generate a dataset of 335K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks.
Score: 13.939350184164017
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language, but many perform poorly on UI tasks due to the lack of UI training data. In this paper, we adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM). Unlike prior art, our method requires no human-provided annotations, and it can be applied to any dataset of UI screenshots. We generate a dataset of 335K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks. To assess the performance of our model, we benchmark it on UI element detection tasks, evaluate response quality, and showcase its applicability to multi-step UI navigation and planning.

Related papers

ABC: Achieving Better Control of Multimodal Embeddings using VLMs [61.396457715710774]
Visual embedding models excel at zero-shot tasks like visual retrieval and classification. Existing CLIP-based approaches embed images and text independently, and fuse the result. We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone.
arXiv Detail & Related papers (2025-03-01T03:29:02Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs [54.58905728115257]
We propose the methodname pipeline for automatically annotating UI elements with detailed functionality descriptions at scale. Specifically, we leverage large language models (LLMs) to infer element functionality by comparing the UI content changes before and after simulated interactions with specific UI elements. We construct an methodname-704k dataset using the proposed pipeline, featuring multi-resolution, multi-device screenshots, diverse data domains, and detailed functionality annotations that have never been provided by previous datasets.
arXiv Detail & Related papers (2025-02-04T03:39:59Z)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations. ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z)
Harnessing Webpage UIs for Text-Rich Visual Understanding [112.01029887404296]
We propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs) These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts.
arXiv Detail & Related papers (2024-10-17T17:48:54Z)
Tell Me What's Next: Textual Foresight for Generic UI Representations [65.10591722192609]
We propose Textual Foresight, a novel pretraining objective for learning UI screen representations. Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken. We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning.
arXiv Detail & Related papers (2024-06-12T02:43:19Z)
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z)
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z)
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API [17.991044940694778]
We build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor. To facilitate the exploitation of image-to-text pretrained knowledge, we follow the pixel-to-sequence paradigm. Our proposed reinforced UI instruction grounding model outperforms the state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2023-10-07T07:22:41Z)
Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus [9.401663915424008]
We propose a vision-language model that only takes the screenshot of the UI and a region of interest on the screen as the input. Our experiments show that our model obtains SoTA results on several representative UI tasks and outperforms previous methods.
arXiv Detail & Related papers (2022-09-29T16:45:43Z)
Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling. With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling. We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z)
UIBert: Learning Generic Multimodal Representations for UI Understanding [12.931540149350633]
We introduce a transformer-based joint image-text model trained through novel pre-training tasks on large-scale unlabeled UI data. Our key intuition is that the heterogeneous features in a UI are self-aligned, i.e., the image and text features of UI components, are predictive of each other. We propose five pretraining tasks utilizing this self-alignment among different features of a UI component and across various components in the same UI. We evaluate our method on nine real-world downstream UI tasks where UIBert outperforms strong multimodal baselines by up to 9.26% accuracy.
arXiv Detail & Related papers (2021-07-29T03:51:36Z)
ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces [12.52699475631247]
We introduce a new pre-trained UI representation model called ActionBert. Our methodology is designed to leverage visual, linguistic and domain-specific features in user interaction traces to pre-train generic feature representations of UIs and their components. Experiments show that the proposed ActionBert model outperforms multi-modal baselines across all downstream tasks by up to 15.5%.
arXiv Detail & Related papers (2020-12-22T20:49:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.