Related papers: Detecting Visual Information Manipulation Attacks in Augmented Reality: A Multimodal Semantic Reasoning Approach

Detecting Visual Information Manipulation Attacks in Augmented Reality: A Multimodal Semantic Reasoning Approach

URL: http://arxiv.org/abs/2507.20356v3
Date: Mon, 04 Aug 2025 03:53:56 GMT
Title: Detecting Visual Information Manipulation Attacks in Augmented Reality: A Multimodal Semantic Reasoning Approach
Authors: Yanming Xiu, Maria Gorlatova,
Abstract summary: We focus on visual information manipulation (VIM) attacks in augmented reality (AR), where virtual content changes the meaning of real-world scenes in subtle but impactful ways.<n>We introduce a taxonomy that categorizes these attacks into three formats: character, phrase, and pattern manipulation, and three purposes: information replacement, information obfuscation, and extra wrong information.<n>Based on the taxonomy, we construct a dataset, AR-VIM, which consists of 452 raw-AR video pairs spanning 202 different scenes.<n>To detect the attacks in the dataset, we propose a multimodal semantic reasoning framework, VIM-Sense.
Score: 2.4171019220503402
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The virtual content in augmented reality (AR) can introduce misleading or harmful information, leading to semantic misunderstandings or user errors. In this work, we focus on visual information manipulation (VIM) attacks in AR, where virtual content changes the meaning of real-world scenes in subtle but impactful ways. We introduce a taxonomy that categorizes these attacks into three formats: character, phrase, and pattern manipulation, and three purposes: information replacement, information obfuscation, and extra wrong information. Based on the taxonomy, we construct a dataset, AR-VIM, which consists of 452 raw-AR video pairs spanning 202 different scenes, each simulating a real-world AR scenario. To detect the attacks in the dataset, we propose a multimodal semantic reasoning framework, VIM-Sense. It combines the language and visual understanding capabilities of vision-language models (VLMs) with optical character recognition (OCR)-based textual analysis. VIM-Sense achieves an attack detection accuracy of 88.94% on AR-VIM, consistently outperforming vision-only and text-only baselines. The system achieves an average attack detection latency of 7.07 seconds in a simulated video processing framework and 7.17 seconds in a real-world evaluation conducted on a mobile Android AR application.

Related papers

Toward Safe, Trustworthy and Realistic Augmented Reality User Experience [0.0]
Our research addresses the risks of task-detrimental AR content, particularly that which obstructs critical information or subtly manipulates user perception.<n>We developed two systems, ViDDAR and VIM-Sense, to detect such attacks using vision-language models (VLMs) and multimodal reasoning modules.
arXiv Detail & Related papers (2025-07-31T03:42:52Z)
VidText: Towards Comprehensive Evaluation for Video Text Understanding [54.15328647518558]
VidText is a benchmark for comprehensive and in-depth evaluation of video text understanding.<n>It covers a wide range of real-world scenarios and supports multilingual content.<n>It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks.
arXiv Detail & Related papers (2025-05-28T19:39:35Z)
Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning [73.7808110878037]
This paper proposes a novel dual-stream Multi-Timescale Motion-Decoupled Spiking Transformer (MDST++)<n>By converting RGB images to events, our method captures motion information more accurately and mitigates background scene biases.<n>Our experiments validate the effectiveness of MDST++, demonstrating their consistent superiority over state-of-the-art methods on mainstream benchmarks.
arXiv Detail & Related papers (2025-05-26T13:06:01Z)
ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality [2.1506382989223782]
ViDDAR is a comprehensive full-reference system to monitor and evaluate virtual content in Augmented Reality environments.<n>To the best of our knowledge, ViDDAR is the first system to employ Vision Language Models (VLMs) for detecting task-detrimental content in AR settings.
arXiv Detail & Related papers (2025-01-22T00:17:08Z)
Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble [3.481985817302898]
We evaluate the capabilities of three state-of-the-art commercial Vision-Language Models (VLMs) in identifying and describing AR scenes.<n>Our findings demonstrate that VLMs are generally capable of perceiving and describing AR scenes.<n>We identify key factors affecting VLM performance, including virtual content placement, rendering quality, and physical plausibility.
arXiv Detail & Related papers (2025-01-21T23:07:03Z)
Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description [56.69740649781989]
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI.<n>We introduce Articulate3D, an expertly curated 3D dataset featuring high-quality manual annotations on 280 indoor scenes.<n>We also present USDNet, a novel unified framework capable of simultaneously predicting part segmentation along with a full specification of motion attributes for articulated objects.
arXiv Detail & Related papers (2024-12-02T11:33:55Z)
Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information. Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z)
Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense [98.70218717851665]
It is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. We present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation.
arXiv Detail & Related papers (2022-11-10T21:44:33Z)
ViA: View-invariant Skeleton Action Representation Learning via Motion Retargeting [10.811088895926776]
ViA is a novel View-Invariant Autoencoder for self-supervised skeleton action representation learning. We conduct a study focusing on transfer-learning for skeleton-based action recognition with self-supervised pre-training on real-world data. Our results showcase that skeleton representations learned from ViA are generic enough to improve upon state-of-the-art action classification accuracy.
arXiv Detail & Related papers (2022-08-31T18:49:38Z)
Video Manipulations Beyond Faces: A Dataset with Human-Machine Analysis [60.13902294276283]
We present VideoSham, a dataset consisting of 826 videos (413 real and 413 manipulated). Many of the existing deepfake datasets focus exclusively on two types of facial manipulations -- swapping with a different subject's face or altering the existing face. Our analysis shows that state-of-the-art manipulation detection algorithms only work for a few specific attacks and do not scale well on VideoSham.
arXiv Detail & Related papers (2022-07-26T17:39:04Z)
Towards Scale Consistent Monocular Visual Odometry by Learning from the Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data. We first train a scale-aware disparity network using both monocular real images and stereo virtual data. The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.