Fugu-MT 論文翻訳(概要): All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles

論文の概要: All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles

arxiv url: http://arxiv.org/abs/2510.26641v1
Date: Thu, 30 Oct 2025 16:08:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-31 16:05:09.898086
Title: All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles
Title（参考訳）: 物体検出に必要なもの - カメラ、ポイント、プロンプトから、自動運転車の次世代核融合・マルチモーダルLDM/VLMまで-
Authors: Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Ahmad Sarlak, Mahlagha Fazeli, Abolfazl Razi,
Abstract要約: 自律走行車(AV)は、インテリジェントな認識、意思決定、制御システムの進歩を通じて、交通の未来を変えつつある。彼らの成功は、複雑でマルチモーダルな環境での信頼性の高いオブジェクト検出という、ひとつのコア能力と結びついている。コンピュータビジョン(CV)と人工知能(AI)の最近の進歩は目覚ましい進歩をもたらした。この調査は、AVにおける物体検出の前方的な分析を提供することによって、そのギャップを埋める。
参考スコア（独自算出の注目度）: 7.863490977061713
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.
Abstract（参考訳）: 自律走行車(AV)は、インテリジェントな認識、意思決定、制御システムの進歩を通じて、交通の未来を変えつつある。しかし、その成功は1つのコア能力、複雑でマルチモーダルな環境での信頼性の高いオブジェクト検出と結びついている。近年のコンピュータビジョン(CV)と人工知能(AI)の進歩は目覚ましい進歩を導いてきたが、知識がマルチモーダルな認識、文脈的推論、協調的知性にまたがって断片化され続けているため、この分野は依然として重要な課題に直面している。この調査は、AVにおけるオブジェクト検出を前方から分析することでギャップを埋め、時代遅れのテクニックを再検討するのではなく、ビジョンランゲージモデル(VLM)、大規模言語モデル(LLM)、ジェネレーティブAIといった新しいパラダイムを強調します。我々はまず、AVセンサの基本スペクトル(カメラ、超音波、LiDAR、レーダー)とその融合戦略を体系的にレビューし、動的駆動環境におけるそれらの能力と限界だけでなく、LLM/VLM駆動型認識フレームワークの最近の進歩と統合する可能性を強調した。次に、簡単なデータ収集を超えて、エゴ車、インフラベース、協調的なデータセット(例えば、V2V、V2I、V2X、I2I)を配置し、続いてデータ構造と特徴を横断分析するAVデータセットの構造化分類を導入する。最終的に、2次元および3次元パイプラインからハイブリッドセンサー融合まで、特にビジョントランスフォーマ(ViT)、大小言語モデル(SLM)、VLMによるトランスフォーマー駆動型アプローチに注目しながら、最先端検出手法を解析する。これらの視点を合成することで、我々の調査は現在の能力、オープン課題、そして将来の機会の明確なロードマップを提供します。

論文の概要: All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles

関連論文リスト