Grounded GUI Understanding for Vision Based Spatial Intelligent Agent: Exemplified by Virtual Reality Apps
- URL: http://arxiv.org/abs/2409.10811v3
- Date: Sat, 26 Oct 2024 05:38:02 GMT
- Title: Grounded GUI Understanding for Vision Based Spatial Intelligent Agent: Exemplified by Virtual Reality Apps
- Authors: Shuqing Li, Binchang Li, Yepang Liu, Cuiyun Gao, Jianping Zhang, Shing-Chi Cheung, Michael R. Lyu,
- Abstract summary: We propose the first zero-shot cOntext-sensitive inteRactable GUI ElemeNT dEtection framework for virtual Reality apps, named Orienter.
By imitating human behaviors, Orienter observes and understands the semantic contexts of VR app scenes first, before performing the detection.
- Score: 41.601579396549404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, spatial computing Virtual Reality (VR) has emerged as a transformative technology, offering users immersive and interactive experiences across diversified virtual environments. Users can interact with VR apps through interactable GUI elements (IGEs) on the stereoscopic three-dimensional (3D) graphical user interface (GUI). The accurate recognition of these IGEs is instrumental, serving as the foundation of many software engineering tasks, including automated testing and effective GUI search. The most recent IGE detection approaches for 2D mobile apps typically train a supervised object detection model based on a large-scale manually-labeled GUI dataset, usually with a pre-defined set of clickable GUI element categories like buttons and spinners. Such approaches can hardly be applied to IGE detection in VR apps, due to a multitude of challenges including complexities posed by open-vocabulary and heterogeneous IGE categories, intricacies of context-sensitive interactability, and the necessities of precise spatial perception and visual-semantic alignment for accurate IGE detection results. Thus, it is necessary to embark on the IGE research tailored to VR apps. In this paper, we propose the first zero-shot cOntext-sensitive inteRactable GUI ElemeNT dEtection framework for virtual Reality apps, named Orienter. By imitating human behaviors, Orienter observes and understands the semantic contexts of VR app scenes first, before performing the detection. The detection process is iterated within a feedback-directed validation and reflection loop. Specifically, Orienter contains three components, including (1) Semantic context comprehension, (2) Reflection-directed IGE candidate detection, and (3) Context-sensitive interactability classification. Extensive experiments demonstrate that Orienter is more effective than the state-of-the-art GUI element detection approaches.
Related papers
- Large Language Model-assisted Speech and Pointing Benefits Multiple 3D Object Selection in Virtual Reality [20.669785157017486]
We explore the possibility of leveraging large language models to assist multi-object selection tasks in virtual reality via a multimodal speech and raycast interaction technique.
Results indicate that the introduced technique, AssistVR, outperforms the baseline technique when there are multiple target objects.
arXiv Detail & Related papers (2024-10-28T14:56:51Z) - Tremor Reduction for Accessible Ray Based Interaction in VR Applications [0.0]
Many traditional 2D interface interaction methods have been directly converted to work in a VR space with little alteration to the input mechanism.
In this paper we propose the use of a low pass filter, to normalize user input noise, alleviating fine motor requirements during ray-based interaction.
arXiv Detail & Related papers (2024-05-12T17:07:16Z) - Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs [5.891295920078768]
We introduce an advanced approach for fine-grained object visual key field detection.
First, we use the segment anything model (SAM) to generate detailed spatial maps of objects in images.
Next, we use Vision Studio to extract semantic object descriptions.
Third, we employ GPT-4's common sense knowledge, bridging the gap between an object's semantics and its spatial map.
arXiv Detail & Related papers (2024-04-01T14:53:36Z) - AgentStudio: A Toolkit for Building General Virtual Agents [57.02375267926862]
General virtual agents need to handle multimodal observations, master complex action spaces, and self-improve in dynamic, open-domain environments.
AgentStudio provides a lightweight, interactive environment with highly generic observation and action spaces.
It integrates tools for creating online benchmark tasks, annotating GUI elements, and labeling actions in videos.
Based on our environment and tools, we curate an online task suite that benchmarks both GUI interactions and function calling with efficient auto-evaluation.
arXiv Detail & Related papers (2024-03-26T17:54:15Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - The Devil is in the Task: Exploiting Reciprocal Appearance-Localization
Features for Monocular 3D Object Detection [62.1185839286255]
Low-cost monocular 3D object detection plays a fundamental role in autonomous driving.
We introduce a Dynamic Feature Reflecting Network, named DFR-Net.
We rank 1st among all the monocular 3D object detectors in the KITTI test set.
arXiv Detail & Related papers (2021-12-28T07:31:18Z) - Semantic Interaction in Augmented Reality Environments for Microsoft
HoloLens [28.10437301492564]
We capture indoor environments and display interaction cues with known object classes using the Microsoft HoloLens.
The 3D mesh recorded by the HoloLens is annotated on-line, as the user moves, with semantic classes using a projective approach.
The results are fused onto the mesh; prominent object segments are identified and displayed in 3D to the user.
arXiv Detail & Related papers (2021-11-18T14:58:04Z) - HIDA: Towards Holistic Indoor Understanding for the Visually Impaired
via Semantic Instance Segmentation with a Wearable Solid-State LiDAR Sensor [25.206941504935685]
HIDA is a lightweight assistive system based on 3D point cloud instance segmentation with a solid-state LiDAR sensor.
Our entire system consists of three hardware components, two interactive functions(obstacle avoidance and object finding) and a voice user interface.
The proposed 3D instance segmentation model has achieved state-of-the-art performance on ScanNet v2 dataset.
arXiv Detail & Related papers (2021-07-07T12:23:53Z) - Improving Point Cloud Semantic Segmentation by Learning 3D Object
Detection [102.62963605429508]
Point cloud semantic segmentation plays an essential role in autonomous driving.
Current 3D semantic segmentation networks focus on convolutional architectures that perform great for well represented classes.
We propose a novel Aware 3D Semantic Detection (DASS) framework that explicitly leverages localization features from an auxiliary 3D object detection task.
arXiv Detail & Related papers (2020-09-22T14:17:40Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.