Related papers: Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble

Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble

URL: http://arxiv.org/abs/2501.13964v3
Date: Sat, 01 Feb 2025 20:27:00 GMT
Title: Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble
Authors: Lin Duan, Yanming Xiu, Maria Gorlatova,
Abstract summary: We evaluate the capabilities of three state-of-the-art commercial Vision-Language Models (VLMs) in identifying and describing AR scenes.<n>Our findings demonstrate that VLMs are generally capable of perceiving and describing AR scenes.<n>We identify key factors affecting VLM performance, including virtual content placement, rendering quality, and physical plausibility.
Score: 3.481985817302898
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Augmented Reality (AR) enhances the real world by integrating virtual content, yet ensuring the quality, usability, and safety of AR experiences presents significant challenges. Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? In this study, we evaluate the capabilities of three state-of-the-art commercial VLMs -- GPT, Gemini, and Claude -- in identifying and describing AR scenes. For this purpose, we use DiverseAR, the first AR dataset specifically designed to assess VLMs' ability to analyze virtual content across a wide range of AR scene complexities. Our findings demonstrate that VLMs are generally capable of perceiving and describing AR scenes, achieving a True Positive Rate (TPR) of up to 93% for perception and 71% for description. While they excel at identifying obvious virtual objects, such as a glowing apple, they struggle when faced with seamlessly integrated content, such as a virtual pot with realistic shadows. Our results highlight both the strengths and the limitations of VLMs in understanding AR scenarios. We identify key factors affecting VLM performance, including virtual content placement, rendering quality, and physical plausibility. This study underscores the potential of VLMs as tools for evaluating the quality of AR experiences.

Related papers

Toward Safe, Trustworthy and Realistic Augmented Reality User Experience [0.0]
Our research addresses the risks of task-detrimental AR content, particularly that which obstructs critical information or subtly manipulates user perception.<n>We developed two systems, ViDDAR and VIM-Sense, to detect such attacks using vision-language models (VLMs) and multimodal reasoning modules.
arXiv Detail & Related papers (2025-07-31T03:42:52Z)
Detecting Visual Information Manipulation Attacks in Augmented Reality: A Multimodal Semantic Reasoning Approach [2.4171019220503402]
We focus on visual information manipulation (VIM) attacks in augmented reality (AR), where virtual content changes the meaning of real-world scenes in subtle but impactful ways.<n>We introduce a taxonomy that categorizes these attacks into three formats: character, phrase, and pattern manipulation, and three purposes: information replacement, information obfuscation, and extra wrong information.<n>Based on the taxonomy, we construct a dataset, AR-VIM, which consists of 452 raw-AR video pairs spanning 202 different scenes.<n>To detect the attacks in the dataset, we propose a multimodal semantic reasoning framework, VIM-Sense.
arXiv Detail & Related papers (2025-07-27T17:04:50Z)
Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities [54.94982467313341]
Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems.<n>We set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking.
arXiv Detail & Related papers (2025-07-10T15:26:41Z)
IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering [7.247417417159471]
Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain.<n>We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition.
arXiv Detail & Related papers (2025-06-29T17:02:57Z)
SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks [53.611256895338585]
We introduce SIRI-Bench, a benchmark designed to evaluate Vision-Language Models' spatial intelligence through video-based reasoning tasks.<n> SIRI-Bench comprises nearly 1K video-question-answer triplets, where each problem is embedded in a realistic 3D scene and captured by video.<n>To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine.
arXiv Detail & Related papers (2025-06-17T13:40:00Z)
Hidden in plain sight: VLMs overlook their visual representations [48.83628674170634]
We compare vision language models (VLMs) to their visual encoders to understand their ability to integrate across these modalities.<n>We find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance.
arXiv Detail & Related papers (2025-06-09T17:59:54Z)
Vision language models are unreliable at trivial spatial cognition [0.2902243522110345]
Vision language models (VLMs) are designed to extract relevant visuospatial information from images. We develop a benchmark dataset -- TableTest -- whose images depict 3D scenes of objects arranged on a table, and used it to evaluate state-of-the-art VLMs. Results show that performance could be degraded by minor variations of prompts that use equivalent descriptions.
arXiv Detail & Related papers (2025-04-22T17:38:01Z)
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives. We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z)
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models [0.6715525121432597]
This research presents a novel vision language model (VLM) framework to enhance feature extraction, scalability, and efficiency. We evaluate the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise. Our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV)
arXiv Detail & Related papers (2025-03-08T01:22:10Z)
ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality [2.1506382989223782]
ViDDAR is a comprehensive full-reference system to monitor and evaluate virtual content in Augmented Reality environments.<n>To the best of our knowledge, ViDDAR is the first system to employ Vision Language Models (VLMs) for detecting task-detrimental content in AR settings.
arXiv Detail & Related papers (2025-01-22T00:17:08Z)
ESVQA: Perceptual Quality Assessment of Egocentric Spatial Videos [71.62145804686062]
We introduce the first Egocentric Spatial Video Quality Assessment Database (ESVQAD), which comprises 600 egocentric spatial videos and their mean opinion scores (MOSs)<n>We propose a novel multi-dimensional binocular feature fusion model, termed ESVQAnet, which integrates binocular spatial, motion, and semantic features to predict the perceptual quality.<n> Experimental results demonstrate the ESVQAnet outperforms 16 state-of-the-art VQA models on the embodied perceptual quality assessment task.
arXiv Detail & Related papers (2024-12-29T10:13:30Z)
Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description [56.69740649781989]
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI.<n>We introduce Articulate3D, an expertly curated 3D dataset featuring high-quality manual annotations on 280 indoor scenes.<n>We also present USDNet, a novel unified framework capable of simultaneously predicting part segmentation along with a full specification of motion attributes for articulated objects.
arXiv Detail & Related papers (2024-12-02T11:33:55Z)
VQA$^2$: Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment (VQA) is a classic field in low-level visual perception. Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can enhance markedly low-level visual quality evaluation. We introduce the VQA2 Instruction dataset - the first visual question answering instruction dataset that focuses on video quality assessment. The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos.
arXiv Detail & Related papers (2024-11-06T09:39:52Z)
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [55.14033256706175]
Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information. We introduce AutoBench-V, an automated framework for serving evaluation on demand. Through an extensive evaluation of seven popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.<n>Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.<n>We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z)
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes. Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes. We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z)
RSGPT: A Remote Sensing Vision Language Model and Benchmark [7.279747655485913]
We build a high-quality Remote Sensing Image Captioning dataset (RSICap) This dataset comprises 2,585 human-annotated captions with rich and high-quality information. We also provide a benchmark evaluation dataset called RSIEval.
arXiv Detail & Related papers (2023-07-28T02:23:35Z)
ArK: Augmented Reality with Knowledge Interactive Emergent Ability [115.72679420999535]
We develop an infinite agent that learns to transfer knowledge memory from general foundation models to novel domains. The heart of our approach is an emerging mechanism, dubbed Augmented Reality with Knowledge Inference Interaction (ArK) We show that our ArK approach, combined with large foundation models, significantly improves the quality of generated 2D/3D scenes.
arXiv Detail & Related papers (2023-05-01T17:57:01Z)
Retargetable AR: Context-aware Augmented Reality in Indoor Scenes based on 3D Scene Graph [0.22940141855172028]
Retargetable AR is a novel AR framework that yields an AR experience that is aware of scene contexts set in various real environments. Our framework constructs a 3D scene graph characterizing the context of a real environment for AR. The correspondence between the constructed graph and an AR scene graph denoting the context of AR content provides a semantically registered content arrangement.
arXiv Detail & Related papers (2020-08-18T09:25:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.