Identifying User Goals from UI Trajectories
- URL: http://arxiv.org/abs/2406.14314v3
- Date: Mon, 03 Mar 2025 15:47:10 GMT
- Title: Identifying User Goals from UI Trajectories
- Authors: Omri Berkovitch, Sapir Caduri, Noam Kahlon, Anatoly Efros, Avi Caciularu, Ido Dagan,
- Abstract summary: We propose a new task goal identification from observed UI trajectories.<n>We also introduce a novel evaluation methodology designed to assess whether two intent descriptions can be considered paraphrases.<n>To benchmark this task, we compare the performance of humans and state-of-the-art models, specifically GPT-4 and Gemini-1.5 Pro.
- Score: 19.492331502146886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Identifying underlying user goals and intents has been recognized as valuable in various personalization-oriented settings, such as personalized agents, improved search responses, advertising, user analytics, and more. In this paper, we propose a new task goal identification from observed UI trajectories aiming to infer the user's detailed intentions when performing a task within UI environments. To support this task, we also introduce a novel evaluation methodology designed to assess whether two intent descriptions can be considered paraphrases within a specific UI environment. Furthermore, we demonstrate how this task can leverage datasets designed for the inverse problem of UI automation, utilizing Android and web datasets for our experiments. To benchmark this task, we compare the performance of humans and state-of-the-art models, specifically GPT-4 and Gemini-1.5 Pro, using our proposed metric. The results reveal that both Gemini and GPT underperform relative to human performance, underscoring the challenge of the proposed task and the significant room for improvement. This work highlights the importance of goal identification within UI trajectories, providing a foundation for further exploration and advancement in this area.
Related papers
- Leveraging Multimodal LLM for Inspirational User Interface Search [12.470067381902972]
Existing AI-based UI search methods often miss crucial semantics like target users or the mood of apps.
We used a multimodal large language model (MLLM) to extract and interpret semantics from mobile UI images.
Our approach significantly outperforms existing UI retrieval methods, offering UI designers a more enriched and contextually relevant search experience.
arXiv Detail & Related papers (2025-01-29T17:38:39Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.
In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Improving User Experience in Preference-Based Optimization of Reward Functions for Assistive Robots [5.523009758632668]
We show that CMA-ES-IG prioritizes the user's experience of the preference learning process.
We show that users find our algorithm more intuitive than previous approaches across both physical and social robot tasks.
arXiv Detail & Related papers (2024-11-17T21:52:58Z) - TinyClick: Single-Turn Agent for Empowering GUI Automation [0.18846515534317265]
We present a single-turn agent for graphical user interface (GUI) interaction tasks, using Vision-Language Model Florence-2-Base.
The agent's primary task is identifying the screen coordinates of the UI element corresponding to the user's command.
It demonstrates strong performance on Screenspot and OmniAct, while maintaining a compact size of 0.27B parameters and minimal latency.
arXiv Detail & Related papers (2024-10-09T12:06:43Z) - Tell Me What's Next: Textual Foresight for Generic UI Representations [65.10591722192609]
We propose Textual Foresight, a novel pretraining objective for learning UI screen representations.
Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken.
We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning.
arXiv Detail & Related papers (2024-06-12T02:43:19Z) - DeTra: A Unified Model for Object Detection and Trajectory Forecasting [68.85128937305697]
Our approach formulates the union of the two tasks as a trajectory refinement problem.
To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects.
In our experiments, we observe that ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Open dataset.
arXiv Detail & Related papers (2024-06-06T18:12:04Z) - Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z) - Location-Aware Visual Question Generation with Lightweight Models [21.278164764804536]
This work introduces a novel task, location-aware visual question generation (LocaVQG)
We represent such location-aware information with surrounding images and a GPS coordinate.
We learn a lightweight model that can address the LocaVQG task and fit on an edge device, such as a mobile phone.
arXiv Detail & Related papers (2023-10-23T17:33:31Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Rules Of Engagement: Levelling Up To Combat Unethical CUI Design [23.01296770233131]
We propose a simplified methodology to assess interfaces based on five dimensions taken from prior research on so-called dark patterns.
Our approach offers a numeric score to its users representing the manipulative nature of evaluated interfaces.
arXiv Detail & Related papers (2022-07-19T14:02:24Z) - First Contact: Unsupervised Human-Machine Co-Adaptation via Mutual
Information Maximization [112.40598205054994]
We formalize this idea as a completely unsupervised objective for optimizing interfaces.
We conduct an observational study on 540K examples of users operating various keyboard and eye gaze interfaces for typing, controlling simulated robots, and playing video games.
The results show that our mutual information scores are predictive of the ground-truth task completion metrics in a variety of domains.
arXiv Detail & Related papers (2022-05-24T21:57:18Z) - X2T: Training an X-to-Text Typing Interface with Online Learning from
User Feedback [83.95599156217945]
We focus on assistive typing applications in which a user cannot operate a keyboard, but can supply other inputs.
Standard methods train a model on a fixed dataset of user inputs, then deploy a static interface that does not learn from its mistakes.
We investigate a simple idea that would enable such interfaces to improve over time, with minimal additional effort from the user.
arXiv Detail & Related papers (2022-03-04T00:07:20Z) - GANSlider: How Users Control Generative Models for Images using Multiple
Sliders with and without Feedforward Information [33.28541180149195]
We investigate how multiple sliders with and without feedforward visualizations influence users' control of generative models.
We found that more control dimensions (sliders) significantly increase task difficulty and user actions.
Visualization alone are not always sufficient for users to understand individual control dimensions.
arXiv Detail & Related papers (2022-02-02T11:25:07Z) - RPT++: Customized Feature Representation for Siamese Visual Tracking [16.305972000224358]
We argue that the performance gain of visual tracking is limited since features extracted from the salient area provide more recognizable visual patterns for classification.
We propose two customized feature extractors, named polar pooling and extreme pooling to capture task-specific visual patterns.
We demonstrate the effectiveness of the task-specific feature representation by integrating it into the recent and advanced tracker RPT.
arXiv Detail & Related papers (2021-10-23T10:58:57Z) - Assisted Perception: Optimizing Observations to Communicate State [112.40598205054994]
We aim to help users estimate the state of the world in tasks like robotic teleoperation and navigation with visual impairments.
We synthesize new observations that lead to more accurate internal state estimates when processed by the user.
arXiv Detail & Related papers (2020-08-06T19:08:05Z) - ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to
Objects [119.46959413000594]
This document summarizes the consensus recommendations of a working group on ObjectNav.
We make recommendations on subtle but important details of evaluation criteria.
We provide a detailed description of the instantiation of these recommendations in challenges organized at the Embodied AI workshop at CVPR 2020.
arXiv Detail & Related papers (2020-06-23T17:18:54Z) - Mining Implicit Entity Preference from User-Item Interaction Data for
Knowledge Graph Completion via Adversarial Learning [82.46332224556257]
We propose a novel adversarial learning approach by leveraging user interaction data for the Knowledge Graph Completion task.
Our generator is isolated from user interaction data, and serves to improve the performance of the discriminator.
To discover implicit entity preference of users, we design an elaborate collaborative learning algorithms based on graph neural networks.
arXiv Detail & Related papers (2020-03-28T05:47:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.