UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity
- URL: http://arxiv.org/abs/2409.04081v3
- Date: Wed, 2 Oct 2024 05:00:57 GMT
- Title: UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity
- Authors: Yicheng Fu, Raviteja Anantha, Prabal Vashisht, Jianpeng Cheng, Etai Littwin,
- Abstract summary: UI-JEPA is a novel framework that employs masking strategies to learn abstract UI embeddings from unlabeled data.
We introduce two new multimodal datasets, "Intent in the Wild" (IIW) and "Intent in the Tame" (IIT)
IIW consists of 1.7K videos across 219 intent categories, while IIT contains 914 videos across 10 categories.
- Score: 7.299239909796724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating user intent from a sequence of user interface (UI) actions is a core challenge in comprehensive UI understanding. Recent advancements in multimodal large language models (MLLMs) have led to substantial progress in this area, but their demands for extensive model parameters, computing power, and high latency makes them impractical for scenarios requiring lightweight, on-device solutions with low latency or heightened privacy. Additionally, the lack of high-quality datasets has hindered the development of such lightweight models. To address these challenges, we propose UI-JEPA, a novel framework that employs masking strategies to learn abstract UI embeddings from unlabeled data through self-supervised learning, combined with an LLM decoder fine-tuned for user intent prediction. We also introduce two new UI-grounded multimodal datasets, "Intent in the Wild" (IIW) and "Intent in the Tame" (IIT), designed for few-shot and zero-shot UI understanding tasks. IIW consists of 1.7K videos across 219 intent categories, while IIT contains 914 videos across 10 categories. We establish the first baselines for these datasets, showing that representations learned using a JEPA-style objective, combined with an LLM decoder, can achieve user intent predictions that match the performance of state-of-the-art large MLLMs, but with significantly reduced annotation and deployment resources. Measured by intent similarity scores, UI-JEPA outperforms GPT-4 Turbo and Claude 3.5 Sonnet by 10.0% and 7.2% respectively, averaged across two datasets. Notably, UI-JEPA accomplishes the performance with a 50.5x reduction in computational cost and a 6.6x improvement in latency in the IIW dataset. These results underscore the effectiveness of UI-JEPA, highlighting its potential for lightweight, high-performance UI understanding.
Related papers
- UI-Venus-1.5 Technical Report [64.4832043785725]
We present UI-Venus-1.5, a unified, end-to-end GUI Agent.<n>The proposed model family comprises two dense variants (2B and 8B) and one mixture-of-experts variant (30B-A3B)<n>In addition, UI-Venus-1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps.
arXiv Detail & Related papers (2026-02-09T18:43:40Z) - From User Interface to Agent Interface: Efficiency Optimization of UI Representations for LLM Agents [21.811753076804944]
Large Language Model (LLM) agents show great potential for automated UI navigation such as automated UI testing and AI assistants.<n>While Large Language Model (LLM) agents show great potential for automated UI navigation such as automated UI testing and AI assistants, their efficiency has been largely overlooked.<n>We present UIFormer, the first automated optimization framework that synthesizes UI transformation programs by conducting constraint-based optimization.
arXiv Detail & Related papers (2025-12-15T15:34:06Z) - PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents [151.86841216364294]
We propose textbfPAL-UI (textbfPlanning with textbfActive textbfLook-back), a novel framework that enables GUI agents to adaptively retrieve past observations when required.<n> PAL-UI combines a dual-level summarization agent, capturing both observation-level cues and action-level outcomes, with a dedicated retrieval tool.
arXiv Detail & Related papers (2025-10-01T01:48:39Z) - UI-UG: A Unified MLLM for UI Understanding and Generation [19.7078650905834]
We introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities.<n>For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding.<n>For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs.
arXiv Detail & Related papers (2025-09-29T06:59:09Z) - UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning [155.51875080423883]
The development of autonomous agents for graphical user interfaces presents major challenges in artificial intelligence.<n>We present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology.<n> Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5.
arXiv Detail & Related papers (2025-09-02T17:44:45Z) - UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis [15.429065788185522]
We introduce a large-scale data synthesis pipeline UI-E2I- Synth for generating varying complex instruction datasets.
We propose a new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to address the limitations of existing benchmarks.
Our model, trained on the synthesized data, achieves superior performance in GUI instruction grounding.
arXiv Detail & Related papers (2025-04-15T14:56:21Z) - Breaking the Data Barrier -- Building GUI Agents Through Task Generalization [25.129269032612832]
We propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage.
We explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning.
Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges.
arXiv Detail & Related papers (2025-04-14T11:35:02Z) - UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning [31.796328505473305]
We propose UI-R1, the first framework to explore how rule-based reinforcement learning can enhance the reasoning capabilities of multimodal large language models (MLLMs) for graphic user interface (GUI) action prediction tasks.
Specifically, UI-R1 introduces a novel rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO)
For efficient training, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices.
arXiv Detail & Related papers (2025-03-27T15:39:30Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.
In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives [50.772462704559345]
We introduce DrICL, a novel optimization method that enhances model performance through Differentiated Learning and advantage-based Reweighting objectives.
Globally, DrICL utilizes differentiated learning to optimize the NLL objective, ensuring that many-shot performance surpasses zero-shot levels.
We develop the Many-Shot ICL Benchmark (ICL-50)-a large-scale benchmark of 50 tasks that cover shot numbers from 1 to 350 within sequences of up to 8,000 tokens-for fine-tuning purposes.
arXiv Detail & Related papers (2025-01-07T14:57:08Z) - Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL)
Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations.
These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z) - EMOv2: Pushing 5M Vision Model Frontier [92.21687467702972]
We set up the new frontier of the 5M magnitude lightweight model on various downstream tasks.
Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer.
Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth, we investigate the performance upper limit of lightweight models with a magnitude of 5M.
arXiv Detail & Related papers (2024-12-09T17:12:22Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data [54.934578742209716]
In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets.
LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student.
Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency.
arXiv Detail & Related papers (2024-11-12T18:57:59Z) - Intent Detection in the Age of LLMs [3.755082744150185]
Intent detection is a critical component of task-oriented dialogue systems (TODS)
Traditional approaches relied on computationally efficient supervised sentence transformer encoder models.
The emergence of generative large language models (LLMs) with intrinsic world knowledge presents new opportunities to address these challenges.
arXiv Detail & Related papers (2024-10-02T15:01:55Z) - VISION2UI: A Real-World Dataset with Layout for Code Generation from UI Designs [29.80918775422563]
We present a novel dataset, termed VISION2UI, extracted from real-world scenarios, augmented with comprehensive layout information.
This dataset is derived through a series of operations, encompassing collecting, cleaning, and filtering of the open-source Common Crawl dataset.
Ultimately, this process yields a dataset comprising 2,000 parallel samples encompassing design visions and UI code.
arXiv Detail & Related papers (2024-04-09T15:05:48Z) - End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames [55.72994484532856]
temporal action detection (TAD) has seen significant performance improvement with end-to-end training.
Due to the memory bottleneck, only models with limited scales and limited data volumes can afford end-to-end training.
We reduce the memory consumption for end-to-end training, and manage to scale up the TAD backbone to 1 billion parameters and the input video to 1,536 frames.
arXiv Detail & Related papers (2023-11-28T21:31:04Z) - Scaling Data Generation in Vision-and-Language Navigation [116.95534559103788]
We propose an effective paradigm for generating large-scale data for learning.
We apply 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs.
Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning.
arXiv Detail & Related papers (2023-07-28T16:03:28Z) - AdaMTL: Adaptive Input-dependent Inference for Efficient Multi-Task
Learning [1.4963011898406864]
We introduce AdaMTL, an adaptive framework that learns task-aware inference policies for multi-task learning models.
AdaMTL reduces the computational complexity by 43% while improving the accuracy by 1.32% compared to single-task models.
When deployed on Vuzix M4000 smart glasses, AdaMTL reduces the inference latency and the energy consumption by up to 21.8% and 37.5%, respectively.
arXiv Detail & Related papers (2023-04-17T20:17:44Z) - Selective In-Context Data Augmentation for Intent Detection using
Pointwise V-Information [100.03188187735624]
We introduce a novel approach based on PLMs and pointwise V-information (PVI), a metric that can measure the usefulness of a datapoint for training a model.
Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints - utterances that correspond to given intents.
Our method is thus able to leverage the expressive power of large language models to produce diverse training data.
arXiv Detail & Related papers (2023-02-10T07:37:49Z) - A lightweight and accurate YOLO-like network for small target detection
in Aerial Imagery [94.78943497436492]
We present YOLO-S, a simple, fast and efficient network for small target detection.
YOLO-S exploits a small feature extractor based on Darknet20, as well as skip connection, via both bypass and concatenation.
YOLO-S has an 87% decrease of parameter size and almost one half FLOPs of YOLOv3, making practical the deployment for low-power industrial applications.
arXiv Detail & Related papers (2022-04-05T16:29:49Z) - LiteMuL: A Lightweight On-Device Sequence Tagger using Multi-task
Learning [1.3192560874022086]
LiteMuL is a lightweight on-device sequence tagger that can efficiently process the user conversations using a Multi-Task Learning approach.
Our model is competitive with other MTL approaches for NER and POS tasks while outshines them with a low memory footprint.
arXiv Detail & Related papers (2020-12-15T19:15:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.