ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality
- URL: http://arxiv.org/abs/2501.12553v1
- Date: Wed, 22 Jan 2025 00:17:08 GMT
- Title: ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality
- Authors: Yanming Xiu, Tim Scargill, Maria Gorlatova,
- Abstract summary: ViDDAR is a comprehensive full-reference system to monitor and evaluate virtual content in Augmented Reality environments.<n>To the best of our knowledge, ViDDAR is the first system to employ Vision Language Models (VLMs) for detecting task-detrimental content in AR settings.
- Score: 2.1506382989223782
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In Augmented Reality (AR), virtual content enhances user experience by providing additional information. However, improperly positioned or designed virtual content can be detrimental to task performance, as it can impair users' ability to accurately interpret real-world information. In this paper we examine two types of task-detrimental virtual content: obstruction attacks, in which virtual content prevents users from seeing real-world objects, and information manipulation attacks, in which virtual content interferes with users' ability to accurately interpret real-world information. We provide a mathematical framework to characterize these attacks and create a custom open-source dataset for attack evaluation. To address these attacks, we introduce ViDDAR (Vision language model-based Task-Detrimental content Detector for Augmented Reality), a comprehensive full-reference system that leverages Vision Language Models (VLMs) and advanced deep learning techniques to monitor and evaluate virtual content in AR environments, employing a user-edge-cloud architecture to balance performance with low latency. To the best of our knowledge, ViDDAR is the first system to employ VLMs for detecting task-detrimental content in AR settings. Our evaluation results demonstrate that ViDDAR effectively understands complex scenes and detects task-detrimental content, achieving up to 92.15% obstruction detection accuracy with a detection latency of 533 ms, and an 82.46% information manipulation content detection accuracy with a latency of 9.62 s.
Related papers
- Toward Safe, Trustworthy and Realistic Augmented Reality User Experience [0.0]
Our research addresses the risks of task-detrimental AR content, particularly that which obstructs critical information or subtly manipulates user perception.<n>We developed two systems, ViDDAR and VIM-Sense, to detect such attacks using vision-language models (VLMs) and multimodal reasoning modules.
arXiv Detail & Related papers (2025-07-31T03:42:52Z) - Detecting Visual Information Manipulation Attacks in Augmented Reality: A Multimodal Semantic Reasoning Approach [2.4171019220503402]
We focus on visual information manipulation (VIM) attacks in augmented reality (AR), where virtual content changes the meaning of real-world scenes in subtle but impactful ways.<n>We introduce a taxonomy that categorizes these attacks into three formats: character, phrase, and pattern manipulation, and three purposes: information replacement, information obfuscation, and extra wrong information.<n>Based on the taxonomy, we construct a dataset, AR-VIM, which consists of 452 raw-AR video pairs spanning 202 different scenes.<n>To detect the attacks in the dataset, we propose a multimodal semantic reasoning framework, VIM-Sense.
arXiv Detail & Related papers (2025-07-27T17:04:50Z) - Hijacking JARVIS: Benchmarking Mobile GUI Agents against Unprivileged Third Parties [19.430061128447022]
We present the first systematic investigation into the vulnerabilities of mobile GUI agents.<n>We introduce a scalable attack simulation framework AgentHazard, which enables flexible and targeted modifications of screen content.<n>We develop a benchmark suite comprising both a dynamic task execution environment and a static dataset of attack scenarios.<n>Our findings reveal that all examined agents are significantly influenced by misleading third-party content.
arXiv Detail & Related papers (2025-07-06T03:31:36Z) - Screen Hijack: Visual Poisoning of VLM Agents in Mobile Environments [61.808686396077036]
We present GHOST, the first clean-label backdoor attack specifically designed for mobile agents built upon vision-language models (VLMs)<n>Our method manipulates only the visual inputs of a portion of the training samples without altering their corresponding labels or instructions.<n>We evaluate our method across six real-world Android apps and three VLM architectures adapted for mobile use.
arXiv Detail & Related papers (2025-06-16T08:09:32Z) - Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble [3.481985817302898]
We evaluate the capabilities of three state-of-the-art commercial Vision-Language Models (VLMs) in identifying and describing AR scenes.
Our findings demonstrate that VLMs are generally capable of perceiving and describing AR scenes.
We identify key factors affecting VLM performance, including virtual content placement, rendering quality, and physical plausibility.
arXiv Detail & Related papers (2025-01-21T23:07:03Z) - Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description [56.69740649781989]
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI.<n>We introduce Articulate3D, an expertly curated 3D dataset featuring high-quality manual annotations on 280 indoor scenes.<n>We also present USDNet, a novel unified framework capable of simultaneously predicting part segmentation along with a full specification of motion attributes for articulated objects.
arXiv Detail & Related papers (2024-12-02T11:33:55Z) - "Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models [74.05368440735468]
Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs)
In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases.
arXiv Detail & Related papers (2024-06-26T05:36:23Z) - Augmented Commonsense Knowledge for Remote Object Grounding [67.30864498454805]
We propose an augmented commonsense knowledge model (ACK) to leverage commonsense information as atemporal knowledge graph for improving agent navigation.
ACK consists of knowledge graph-aware cross-modal and concept aggregation modules to enhance visual representation and visual-textual data alignment.
We add a new pipeline for the commonsense-based decision-making process which leads to more accurate local action prediction.
arXiv Detail & Related papers (2024-06-03T12:12:33Z) - Learning High-Quality Navigation and Zooming on Omnidirectional Images in Virtual Reality [37.564863636844905]
We present a novel system, called OmniVR, designed to enhance visual clarity during VR navigation.
Our system enables users to effortlessly locate and zoom in on the objects of interest in VR.
arXiv Detail & Related papers (2024-05-01T07:08:24Z) - Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs [38.02017186215372]
Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks.
However, existing V-LLMs demonstrate weak spatial reasoning and localization awareness.
We explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs.
arXiv Detail & Related papers (2024-04-11T03:09:34Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation [61.08389704326803]
Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes.
Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates.
We propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability.
arXiv Detail & Related papers (2023-03-28T08:00:46Z) - Real or Virtual: A Video Conferencing Background Manipulation-Detection
System [25.94894351460089]
We present a detection strategy to distinguish between real and virtual video conferencing user backgrounds.
We demonstrate the robustness of our detector against different adversarial attacks that the adversary considers.
Our performance results show that we can perfectly identify a real from a virtual background with an accuracy of 99.80%.
arXiv Detail & Related papers (2022-04-25T08:14:11Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge [133.80567761430584]
We collect a large-scale dataset called OVIS for video instance segmentation in the occluded scenario.
OVIS consists of 296k high-quality instance masks and 901 occluded scenes.
All baseline methods encounter a significant performance degradation of about 80% in the heavily occluded object group.
arXiv Detail & Related papers (2021-11-15T17:59:03Z) - Embodied Visual Active Learning for Semantic Segmentation [33.02424587900808]
We study the task of embodied visual active learning, where an agent is set to explore a 3d environment with the goal to acquire visual scene understanding.
We develop a battery of agents - both learnt and pre-specified - and with different levels of knowledge of the environment.
We extensively evaluate the proposed models using the Matterport3D simulator and show that a fully learnt method outperforms comparable pre-specified counterparts.
arXiv Detail & Related papers (2020-12-17T11:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.