Fugu-MT 論文翻訳(概要): WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

論文の概要: WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

arxiv url: http://arxiv.org/abs/2511.22154v2
Date: Tue, 02 Dec 2025 08:14:37 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-03 14:50:32.053385
Title: WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios
Title（参考訳）: WearVQA:Egocentric Authentic Real-worldシナリオにおけるウェアラブルの視覚的回答ベンチマーク
Authors: Eun Chang, Zhuangqun Huang, Yiwei Liao, Sagar Ravi Bhavsar, Amogh Param, Tammy Stark, Adel Ahmadyan, Xiao Yang, Jiaqi Wang, Ahsan Abdullah, Giang Nguyen, Akil Iyer, David Hall, Elissa Li, Shane Moon, Nicolas Scheffer, Kirmani Ahmed, Babak Damavandi, Rakesh Wanga, Anuj Kumar, Rohit Patel, Xin Luna Dong,
Abstract要約: 我々は、スマートグラスのようなウェアラブルデバイス上で、マルチモデルAIアシスタントの視覚質問回答機能を評価するために設計された最初のベンチマークであるWearVQAを紹介する。 WearVQAは、エゴ中心のインタラクションのユニークな課題を反映している。ベンチマークは、2,520個の精巧にキュレートされた画像検索用三つ子で構成され、7つの異なる画像ドメインにまたがる。
参考スコア（独自算出の注目度）: 19.156760664417718
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We introduce WearVQA, the first benchmark specifically designed to evaluate the Visual Question Answering (VQA) capabilities of multi-model AI assistant on wearable devices like smart glasses. Unlike prior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique challenges of ego-centric interaction-where visual inputs may be occluded, poorly lit, unzoomed, or blurry, and questions are grounded in realistic wearable use cases. The benchmark comprises 2,520 carefully curated image-question-answer triplets, spanning 7 diverse image domains including both text-centric and general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning, and 6 common wearables-specific image quality issues. All questions are designed to be answerable using only the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluation framework with 96% labeling accuracy. Open-source and proprietary multi-model LLMs achieved a QA accuracy as low as 24-52% on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks. These observations position WearVQA as a comprehensive and challenging benchmark for guiding technical advancement towards robust, real-world multi-model wearables AI systems.
Abstract（参考訳）: スマートグラスのようなウェアラブルデバイス上でのマルチモデルAIアシスタントの視覚質問応答(VQA)機能を評価するために設計された最初のベンチマークであるWearVQAを紹介する。高品質で第三者の画像に焦点を当てた以前のベンチマークとは異なり、WearVQAはエゴ中心のインタラクションのユニークな課題を反映している。ベンチマークは、テキスト中心と一般的なシーンの両方を含む7つの多様な画像ドメイン、基本的な認識からさまざまな推論までの10の認知タスクタイプ、一般的なウェアラブル固有の画像品質問題を含む、2,520の精巧にキュレートされた画像検索用三脚からなる。すべての質問は、視覚的な入力と常識のみを使用して答えられるように設計されている。 WearVQAは、厳格なLCM-as-a-judge評価フレームワークと96%のラベル精度でペアリングされている。オープンソースでプロプライエタリなマルチモデルLPMは、WearVQAでは24-52%の精度でQAを達成し、低品質の画像や推論処理のタスクは大幅に削減された。これらの観察は、WearVQAを、堅牢で実世界のマルチモデルウェアラブルAIシステムへの技術的進歩を導くための包括的で挑戦的なベンチマークとして位置付けている。

論文の概要: WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

関連論文リスト