Fugu-MT 論文翻訳(概要): V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

論文の概要: V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

arxiv url: http://arxiv.org/abs/2604.02710v1
Date: Fri, 03 Apr 2026 04:07:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 17:20:24.317295
Title: V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views
Title（参考訳）: V2X-QA:マルチモーダル大規模言語モデルのための総合推論データセットとベンチマーク
Authors: Junwei You, Pei Li, Zhuoyu Jiang, Weizhe Tang, Zilin Huang, Rui Gan, Jiaxi Liu, Yan Zhao, Sikai Chen, Bin Ran,
Abstract要約: V2X-QAは、車側、インフラ側、協調的な視点でMLLMを評価するための実世界のデータセットとベンチマークである。その結果、視点アクセシビリティは性能に大きく影響し、インフラ側の推論は意味のあるマクロなトラフィック理解を支援することがわかった。 V2X-MoEは、明示的なビュールーティングと視点固有のLoRAエキスパートを備えたベンチマークアラインベースラインである。
参考スコア（独自算出の注目度）: 22.24590004859344
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) have shown strong potential for autonomous driving, yet existing benchmarks remain largely ego-centric and therefore cannot systematically assess model performance in infrastructure-centric and cooperative driving conditions. In this work, we introduce V2X-QA, a real-world dataset and benchmark for evaluating MLLMs across vehicle-side, infrastructure-side, and cooperative viewpoints. V2X-QA is built around a view-decoupled evaluation protocol that enables controlled comparison under vehicle-only, infrastructure-only, and cooperative driving conditions within a unified multiple-choice question answering (MCQA) framework. The benchmark is organized into a twelve-task taxonomy spanning perception, prediction, and reasoning and planning, and is constructed through expert-verified MCQA annotation to enable fine-grained diagnosis of viewpoint-dependent capabilities. Benchmark results across ten representative state-of-the-art proprietary and open-source models show that viewpoint accessibility substantially affects performance, and infrastructure-side reasoning supports meaningful macroscopic traffic understanding. Results also indicate that cooperative reasoning remains challenging since it requires cross-view alignment and evidence integration rather than simply additional visual input. To address these challenges, we introduce V2X-MoE, a benchmark-aligned baseline with explicit view routing and viewpoint-specific LoRA experts. The strong performance of V2X-MoE further suggests that explicit viewpoint specialization is a promising direction for multi-view reasoning in autonomous driving. Overall, V2X-QA provides a foundation for studying multi-perspective reasoning, reliability, and cooperative physical intelligence in connected autonomous driving. The dataset and V2X-MoE resources are publicly available at: https://github.com/junwei0001/V2X-QA.
Abstract（参考訳）: MLLM(Multimodal large language model)は、自律運転の強い可能性を示しているが、既存のベンチマークの大部分はエゴ中心であり、インフラ中心および協調運転条件におけるモデル性能を体系的に評価することはできない。本稿では,車側,インフラ側,協調的な視点でMLLMを評価するための実世界のデータセットとベンチマークであるV2X-QAを紹介する。 V2X-QAは、車両のみ、インフラのみ、および協調運転条件下で制御された比較を可能にするビュー分離評価プロトコルを、MCQA(Multiple-choice Question answering)フレームワークで構築する。このベンチマークは、知覚、予測、推論、計画にまたがる12タスクの分類に分類され、専門家が検証したMCQAアノテーションによって構築され、視点依存能力のきめ細かい診断を可能にする。 10の最先端のプロプライエタリおよびオープンソースモデルのベンチマーク結果から、視点アクセシビリティはパフォーマンスに大きく影響し、インフラストラクチャ側の推論は意味のあるマクロなトラフィック理解をサポートする。結果は、単に視覚的な入力を追加するのではなく、クロスビューアライメントとエビデンス統合を必要とするため、協調的推論は依然として困難であることを示している。これらの課題に対処するために、明示的なビュールーティングと視点固有のLoRAエキスパートを備えたベンチマークアラインベースラインであるV2X-MoEを紹介する。 V2X-MoEの強い性能は、明示的な視点の特殊化が自律運転における多視点推論の有望な方向であることを示唆している。全体として、V2X-QAは、接続された自律運転における多視点推論、信頼性、協調的な物理的知性の研究の基礎を提供する。データセットとV2X-MoEリソースは、https://github.com/junwei0001/V2X-QAで公開されている。

論文の概要: V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

関連論文リスト