Related papers: WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

URL: http://arxiv.org/abs/2511.22154v2
Date: Tue, 02 Dec 2025 08:14:37 GMT
Title: WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios
Authors: Eun Chang, Zhuangqun Huang, Yiwei Liao, Sagar Ravi Bhavsar, Amogh Param, Tammy Stark, Adel Ahmadyan, Xiao Yang, Jiaqi Wang, Ahsan Abdullah, Giang Nguyen, Akil Iyer, David Hall, Elissa Li, Shane Moon, Nicolas Scheffer, Kirmani Ahmed, Babak Damavandi, Rakesh Wanga, Anuj Kumar, Rohit Patel, Xin Luna Dong,
Abstract summary: We introduce WearVQA, the first benchmark specifically designed to evaluate the Visual Question Answering capabilities of multi-model AI assistant on wearable devices like smart glasses.<n>WearVQA reflects the unique challenges of ego-centric interaction-where visual inputs may be occluded, poorly lit, unzoomed, or blurry.<n>The benchmark comprises 2,520 carefully curated image-question-answer triplets, spanning 7 diverse image domains.
Score: 19.156760664417718
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We introduce WearVQA, the first benchmark specifically designed to evaluate the Visual Question Answering (VQA) capabilities of multi-model AI assistant on wearable devices like smart glasses. Unlike prior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique challenges of ego-centric interaction-where visual inputs may be occluded, poorly lit, unzoomed, or blurry, and questions are grounded in realistic wearable use cases. The benchmark comprises 2,520 carefully curated image-question-answer triplets, spanning 7 diverse image domains including both text-centric and general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning, and 6 common wearables-specific image quality issues. All questions are designed to be answerable using only the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluation framework with 96% labeling accuracy. Open-source and proprietary multi-model LLMs achieved a QA accuracy as low as 24-52% on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks. These observations position WearVQA as a comprehensive and challenging benchmark for guiding technical advancement towards robust, real-world multi-model wearables AI systems.

Related papers

Surveillance Facial Image Quality Assessment: A Multi-dimensional Dataset and Lightweight Model [59.39390911456143]
We propose the first comprehensive study on surveillance facial image quality assessment (SFIQA)<n>SFIQA-Bench consists of 5,004 surveillance facial images captured by three widely deployed surveillance cameras in real-world scenarios.<n>A subjective experiment is conducted to collect six dimensional quality ratings, including noise, sharpness, colorfulness, contrast, fidelity and overall quality.
arXiv Detail & Related papers (2026-02-07T06:51:03Z)
VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes [36.370533774426555]
We present VisualOverload, a visual question answering (VQA) benchmark comprising 2,720 question-answer pairs.<n>Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated scenes.<n>We observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions.
arXiv Detail & Related papers (2025-09-29T18:00:25Z)
MVQA-68K: A Multi-dimensional and Causally-annotated Dataset with Quality Interpretability for Video Assessment [14.705190484805962]
Video quality assessment (VQA) is becoming increasingly crucial for selecting high-quality videos from large-scale datasets used in pre-training.<n>We introduce MVQA-68K, a novel multi-dimensional VQA dataset comprising over 68,000 carefully annotated videos.<n>Experiments demonstrate that MVQA-68K significantly enhances the performance of various large language models (MLLMs) on the VQA task.
arXiv Detail & Related papers (2025-09-15T05:16:54Z)
VQA$^2$: Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment (VQA) is a classic field in low-level visual perception.<n>Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can enhance markedly low-level visual quality evaluation.<n>We introduce the VQA2 Instruction dataset - the first visual question answering instruction dataset that focuses on video quality assessment.<n>The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos.
arXiv Detail & Related papers (2024-11-06T09:39:52Z)
ESIQA: Perceptual Quality Assessment of Vision-Pro-based Egocentric Spatial Images [70.68629648595677]
Egocentric images and videos are emerging as a compelling form of stereoscopic XR content.<n>The corresponding image quality assessment (IQA) research for egocentric spatial images is still lacking.<n>In this paper, we establish the Egocentric Spatial Images Quality Assessment Database (ESQAD), the first IQA database dedicated for egocentric spatial images.
arXiv Detail & Related papers (2024-07-31T06:20:21Z)
Visual Robustness Benchmark for Visual Question Answering (VQA) [0.08246494848934446]
We propose the first large-scale benchmark comprising 213,000 augmented images. We challenge the visual robustness of multiple VQA models and assess the strength of realistic visual corruptions.
arXiv Detail & Related papers (2024-07-03T08:35:03Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. We leverage existing 3D detection annotations to generate scene graphs and design question templates manually. We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z)
Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective [93.56647950778357]
Blind image quality assessment (BIQA) predicts the human perception of image quality without any reference information. We develop a general and automated multitask learning scheme for BIQA to exploit auxiliary knowledge from other tasks.
arXiv Detail & Related papers (2023-03-27T07:58:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.