Fugu-MT 論文翻訳(概要): SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

論文の概要: SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

arxiv url: http://arxiv.org/abs/2602.22683v1
Date: Thu, 26 Feb 2026 06:55:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-27 18:41:22.565555
Title: SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses
Title（参考訳）: SuperPERGLASSES:AIスマートグラスのインテリジェントエージェントとしてのビジョン言語モデルのベンチマーク
Authors: Zhuohang Jiang, Xu Yuan, Haohao Qu, Shanru Lin, Kanglong Liu, Wenqi Fan, Qing Li,
Abstract要約: SuPERGLASSESは、スマートグラスデバイスによって収集された実世界のデータに基づいて構築された、初めての総合的なビジュアル質問回答ベンチマークである。我々のエージェントは、GPT-4oを2.19パーセント上回る最先端性能を実現し、スマートグラスVQAシナリオにおけるタスク固有のソリューションの必要性を強調した。
参考スコア（独自算出の注目度）: 22.22405739343465
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of AI-powered smart glasses, one of the hottest wearable devices, has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPERGLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this benchmark, revealing significant performance gaps. To address the limitations of existing models, we further propose SUPERLENS, a multimodal smart glasses agent that enables retrieval-augmented answer generation by integrating automatic object detection, query decoupling, and multimodal web search. Our agent achieves state-of-the-art performance, surpassing GPT-4o by 2.19 percent, and highlights the need for task-specific solutions in smart glasses VQA scenarios.
Abstract（参考訳）: 最もホットなウェアラブルデバイスの1つであるAIを使ったスマートグラスの急速な進歩は、マルチモーダルインタラクションのための新たなフロンティアを開放し、コアアプリケーションとして出現する外部知識ソースに対するビジュアル質問回答(VQA)が導入された。スマートグラスに適応した既存のビジョン言語モデル(VLM)は、通常、従来のマルチモーダルデータセットでトレーニングされ、評価されるが、これらのデータセットには、スマートグラスの使用シナリオを反映し、特定の課題から逸脱するために必要な多様性と現実性がない。このギャップを埋めるために、私たちは、スマートグラスデバイスで完全に収集された実世界のデータに基づいて構築された初の総合的なVQAベンチマークであるSUPERGLASSESを紹介します。 SUPERGLASSESは、14のイメージドメインと8のクエリカテゴリにまたがる2,422のエゴセントリックな画像検索ペアで構成され、完全な検索軌跡と推論アノテーションが備わっている。このベンチマークで26の代表的なVLMを評価し、大きな性能差を明らかにした。既存モデルの限界に対処するために,自動オブジェクト検出,クエリデカップリング,マルチモーダルWeb検索を統合することで,検索強化された回答生成を可能にするマルチモーダルスマートグラスエージェントSUPERLENSを提案する。我々のエージェントは、GPT-4oを2.19パーセント上回る最先端性能を実現し、スマートグラスVQAシナリオにおけるタスク固有のソリューションの必要性を強調した。

論文の概要: SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

関連論文リスト