Related papers: UI-UG: A Unified MLLM for UI Understanding and Generation

UI-UG: A Unified MLLM for UI Understanding and Generation

URL: http://arxiv.org/abs/2509.24361v2
Date: Tue, 30 Sep 2025 07:45:11 GMT
Title: UI-UG: A Unified MLLM for UI Understanding and Generation
Authors: Hao Yang, Weijie Qiu, Ru Zhang, Zhou Fang, Ruichao Mao, Xiaoyu Lin, Maji Huang, Zhaosong Huang, Teng Guo, Shuoyang Liu, Hai Rao,
Abstract summary: We introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities.<n>For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding.<n>For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs.
Score: 19.7078650905834
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although Multimodal Large Language Models (MLLMs) have been widely applied across domains, they are still facing challenges in domain-specific tasks, such as User Interface (UI) understanding accuracy and UI generation quality. In this paper, we introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities. For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding on the modern complex UI data. For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs. In addition, we propose an industrially effective workflow, including the design of an LLM-friendly domain-specific language (DSL), training strategies, rendering processes, and evaluation metrics. In experiments, our model achieves state-of-the-art (SOTA) performance on understanding tasks, outperforming both larger general-purpose MLLMs and similarly-sized UI-specialized models. Our model is also on par with these larger MLLMs in UI generation performance at a fraction of the computational cost. We also demonstrate that integrating understanding and generation tasks can improve accuracy and quality for both tasks. Code and Model: https://github.com/neovateai/UI-UG

Related papers

ThinkGen: Generalized Thinking for Visual Generation [97.19923474851987]
ThinkGen is a think-driven visual generation framework that explicitly leverages Chain-of-Thought (CoT) reasoning in various generation scenarios.<n>We propose a separable GRPO-based training paradigm, alternating reinforcement learning between the MLLM and DiT modules.<n>Experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks.
arXiv Detail & Related papers (2025-12-29T16:08:50Z)
From User Interface to Agent Interface: Efficiency Optimization of UI Representations for LLM Agents [21.811753076804944]
Large Language Model (LLM) agents show great potential for automated UI navigation such as automated UI testing and AI assistants.<n>While Large Language Model (LLM) agents show great potential for automated UI navigation such as automated UI testing and AI assistants, their efficiency has been largely overlooked.<n>We present UIFormer, the first automated optimization framework that synthesizes UI transformation programs by conducting constraint-based optimization.
arXiv Detail & Related papers (2025-12-15T15:34:06Z)
AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent [21.148033135113927]
We introduce an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation.<n>We propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings.<n>We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.
arXiv Detail & Related papers (2025-11-30T11:32:54Z)
Structuring GUI Elements through Vision Language Models: Towards Action Space Generation [43.932266242034025]
Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction.<n>This paper focuses on the application of MLLMs in the field of graphical user interface (GUI) elements structuring.<n>We introduce an IoU-Augmented Maximum Likelihood (IAML) training paradigm to bolster visual module capabilities.
arXiv Detail & Related papers (2025-08-22T10:14:15Z)
API Agents vs. GUI Agents: Divergence and Convergence [37.13923771130588]
API- and GUI-based large language models (LLMs) interact with graphical user interfaces in a human-like manner.<n>This paper systematically analyzes their divergence and potential convergence.<n>We indicate that continuing innovations in LLM-based automation are poised to blur the lines between API- and GUI-driven agents.
arXiv Detail & Related papers (2025-03-14T04:26:21Z)
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment [58.94611347128066]
multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals.<n>Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance.<n>We propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks.
arXiv Detail & Related papers (2024-12-26T18:56:05Z)
EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.<n>EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z)
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z)
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references. Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z)
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs. This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z)
ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations [13.939350184164017]
Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language. We adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM) We generate a dataset of 335K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks.
arXiv Detail & Related papers (2023-10-07T16:32:34Z)
Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations [53.76682562935373]
We introduce an efficient framework called textbfInteRecAgent, which employs LLMs as the brain and recommender models as tools. InteRecAgent achieves satisfying performance as a conversational recommender system, outperforming general-purpose LLMs.
arXiv Detail & Related papers (2023-08-31T07:36:44Z)
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs) This integration promotes a more detailed comprehension of images for the MLLM. We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.