Fugu-MT 論文翻訳(概要): Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception

論文の概要: Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception

arxiv url: http://arxiv.org/abs/2510.08352v1
Date: Thu, 09 Oct 2025 15:38:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:15.169229
Title: Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception
Title（参考訳）: 距離依存型交通知覚における最小視線モデルの評価
Authors: Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins, Anthony Scanlan, Ciaran Eising,
Abstract要約: 本稿では,DTPQA(Distance-Annotated Traffic Perception Question Answering)ベンチマークを紹介する。最初のVisual Question Answering (VQA)ベンチマークは、トラフィックシーンにおける知覚に基づく質問のみに焦点を当てたものだ。 DTPQA上では、いくつかの最先端(SOTA)小型ビジョンランゲージモデル(VLM)を評価する。
参考スコア（独自算出の注目度）: 0.7644902597398215
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) are becoming increasingly powerful, demonstrating strong performance on a variety of tasks that require both visual and textual understanding. Their strong generalisation abilities make them a promising component for automated driving systems, which must handle unexpected corner cases. However, to be trusted in such safety-critical applications, a model must first possess a reliable perception system. Moreover, since critical objects and agents in traffic scenes are often at a distance, we require systems that are not "shortsighted", i.e., systems with strong perception capabilities at both close (up to 20 meters) and long (30+ meters) range. With this in mind, we introduce Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes, enriched with distance annotations. By excluding questions that require reasoning, we ensure that model performance reflects perception capabilities alone. Since automated driving hardware has limited processing power and cannot support large VLMs, our study centers on smaller VLMs. More specifically, we evaluate several state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the simplicity of the questions, these models significantly underperform compared to humans (~60% average accuracy for the best-performing small VLM versus ~85% human performance). However, it is important to note that the human sample size was relatively small, which imposes statistical limitations. We also identify specific perception tasks, such as distinguishing left from right, that remain particularly challenging for these models.
Abstract（参考訳）: VLM(Vision-Language Models)はますます強力になり、視覚的およびテキスト的理解を必要とする様々なタスクにおいて強力なパフォーマンスを示す。その強力な一般化能力は、自動走行システムにとって有望なコンポーネントとなり、予期せぬコーナーケースに対処する必要がある。しかし、そのような安全クリティカルな応用を信頼するには、まずモデルに信頼性のある認識システムを持たなければならない。さらに、交通現場における重要な物体やエージェントは、しばしば距離が離れているため、近距離(20メートル以上)と長距離(30メートル以上)の両方で強い知覚能力を持つシステムに「近視」しないシステムが必要である。このことを念頭に置いて、距離アノテーションに富んだ交通シーンにおける知覚に基づく質問に焦点をあてた最初のビジュアル質問回答(VQA)ベンチマークであるDTPQA(Distance-Annotated Traffic Perception Question Answering)を紹介する。推論を必要とする質問を除外することで、モデルの性能が知覚能力のみを反映することを保証します。自動駆動ハードウェアは処理能力が限られており,大規模なVLMをサポートできないため,本研究はより小型のVLMに重点を置いている。より具体的には、DTPQA上でのSOTA(State-of-the-art)の小型VLMを評価し、質問の単純さにもかかわらず、これらのモデルは人間に比べて著しく性能が劣っていることを示す。しかし、ヒトの標本サイズは比較的小さく、統計的に制限されていることに注意する必要がある。また、左と右を区別するような特定の知覚タスクも、これらのモデルでは特に困難なままである。

論文の概要: Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception

関連論文リスト