Related papers: Leveraging Multimodal LLM for Inspirational User Interface Search

Leveraging Multimodal LLM for Inspirational User Interface Search

URL: http://arxiv.org/abs/2501.17799v3
Date: Sat, 15 Feb 2025 06:31:08 GMT
Title: Leveraging Multimodal LLM for Inspirational User Interface Search
Authors: Seokhyeon Park, Yumin Song, Soohyun Lee, Jaeyoung Kim, Jinwook Seo,
Abstract summary: Existing AI-based UI search methods often miss crucial semantics like target users or the mood of apps.<n>We used a multimodal large language model (MLLM) to extract and interpret semantics from mobile UI images.<n>Our approach significantly outperforms existing UI retrieval methods, offering UI designers a more enriched and contextually relevant search experience.
Score: 12.470067381902972
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Inspirational search, the process of exploring designs to inform and inspire new creative work, is pivotal in mobile user interface (UI) design. However, exploring the vast space of UI references remains a challenge. Existing AI-based UI search methods often miss crucial semantics like target users or the mood of apps. Additionally, these models typically require metadata like view hierarchies, limiting their practical use. We used a multimodal large language model (MLLM) to extract and interpret semantics from mobile UI images. We identified key UI semantics through a formative study and developed a semantic-based UI search system. Through computational and human evaluations, we demonstrate that our approach significantly outperforms existing UI retrieval methods, offering UI designers a more enriched and contextually relevant search experience. We enhance the understanding of mobile UI design semantics and highlight MLLMs' potential in inspirational search, providing a rich dataset of UI semantics for future studies.

Related papers

Bridging Gulfs in UI Generation through Semantic Guidance [16.245249868262178]
We develop a system that enables users to specify semantics, visualize relationships, and extract how semantics are reflected in generated UIs.<n>A comparative user study suggests that our approach enhances users' perceived control over intent expression, outcome interpretation, and facilitates more predictable, iterative refinement.
arXiv Detail & Related papers (2026-01-27T04:01:53Z)
AUTO-Explorer: Automated Data Collection for GUI Agent [58.58097564914626]
We propose an automated data collection method with minimal annotation costs, named Auto-Explorer.<n>It incorporates a simple yet effective exploration mechanism that autonomously parses and explores GUI environments.<n>Using the data gathered, we fine-tune a multimodal large language model (MLLM) and establish a GUI element grounding testing set.
arXiv Detail & Related papers (2025-11-09T15:13:45Z)
MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces [97.62557395494962]
We use crowdsourcing to benchmark GPT-4o, Claude, and Llama across 30 interfaces.<n>Our results show that MLLMs approximate human preferences on some dimensions but diverge on others.
arXiv Detail & Related papers (2025-10-09T20:00:41Z)
Generative Interfaces for Language Models [70.25765232527762]
We propose a paradigm in which large language models (LLMs) respond to user queries by proactively generating user interfaces (UIs)<n>Our framework leverages structured interface-specific representations and iterative refinements to translate user queries into task-specific UIs.<n>Results show that generative interfaces consistently outperform conversational ones, with up to a 72% improvement in human preference.
arXiv Detail & Related papers (2025-08-26T17:43:20Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
LLM-assisted Explicit and Implicit Multi-interest Learning Framework for Sequential Recommendation [50.98046887582194]
We propose an explicit and implicit multi-interest learning framework to model user interests on two levels: behavior and semantics. The proposed EIMF framework effectively and efficiently combines small models with LLM to improve the accuracy of multi-interest modeling.
arXiv Detail & Related papers (2024-11-14T13:00:23Z)
Harnessing Webpage UIs for Text-Rich Visual Understanding [112.01029887404296]
We propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs) These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts.
arXiv Detail & Related papers (2024-10-17T17:48:54Z)
Tell Me What's Next: Textual Foresight for Generic UI Representations [65.10591722192609]
We propose Textual Foresight, a novel pretraining objective for learning UI screen representations. Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken. We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning.
arXiv Detail & Related papers (2024-06-12T02:43:19Z)
Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors [49.99728312519117]
The aim of this work is to establish how accurately a recent semantic-based active perception model is able to complete visual tasks that are regularly performed by humans. This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations. In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model.
arXiv Detail & Related papers (2024-04-16T18:15:57Z)
UI Layout Generation with LLMs Guided by UI Grammar [13.172638190095395]
Large Language Models (LLMs) have stimulated interest among researchers and industry professionals. This paper proposes the introduction of UI grammar -- a novel approach to represent the hierarchical structure inherent in UI screens. The aim of this approach is to guide the generative capacities of LLMs more effectively and improve the explainability and controllability of the process.
arXiv Detail & Related papers (2023-10-24T02:00:12Z)
ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations [13.939350184164017]
Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language. We adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM) We generate a dataset of 335K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks.
arXiv Detail & Related papers (2023-10-07T16:32:34Z)
Towards Better Semantic Understanding of Mobile Interfaces [7.756895821262432]
We release a human-annotated dataset with approximately 500k unique annotations aimed at increasing the understanding of the functionality of UI elements. This dataset augments images and view hierarchies from RICO, a large dataset of mobile UIs. We also release models using image-only and multimodal inputs; we experiment with various architectures and study the benefits of using multimodal inputs on the new dataset.
arXiv Detail & Related papers (2022-10-06T03:48:54Z)
Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus [9.401663915424008]
We propose a vision-language model that only takes the screenshot of the UI and a region of interest on the screen as the input. Our experiments show that our model obtains SoTA results on several representative UI tasks and outperforms previous methods.
arXiv Detail & Related papers (2022-09-29T16:45:43Z)
Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics. We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention. We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z)
VINS: Visual Search for Mobile User Interface Design [66.28088601689069]
This paper introduces VINS, a visual search framework, that takes as input a UI image and retrieves visually similar design examples. The framework achieves a mean Average Precision of 76.39% for the UI detection and high performance in querying similar UI designs.
arXiv Detail & Related papers (2021-02-10T01:46:33Z)
ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces [12.52699475631247]
We introduce a new pre-trained UI representation model called ActionBert. Our methodology is designed to leverage visual, linguistic and domain-specific features in user interaction traces to pre-train generic feature representations of UIs and their components. Experiments show that the proposed ActionBert model outperforms multi-modal baselines across all downstream tasks by up to 15.5%.
arXiv Detail & Related papers (2020-12-22T20:49:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.