Fugu-MT 論文翻訳(概要): OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving

論文の概要: OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving

arxiv url: http://arxiv.org/abs/2509.19973v2
Date: Thu, 25 Sep 2025 06:33:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-26 12:02:33.945817
Title: OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving
Title（参考訳）: OmniScene: 自律運転のための注意増進型マルチモーダル4Dシーン理解
Authors: Pei Liu, Hongliang Lu, Haichao Liu, Haipeng Liu, Xin Liu, Ruoyu Yao, Shengbo Eben Li, Jun Ma,
Abstract要約: 人間の視覚は、2次元の観察をエゴセントリックな3次元のシーン理解に変換することができる。我々は,総合的な4Dシーン理解のための多視点と時間的知覚を統合する,OmniSceneと呼ばれる新しいヒューマンライクなフレームワークを提案する。我々のアプローチは、認識、予測、計画、視覚的質問応答における新しいベンチマークを確立することで、常に優れた結果を達成する。
参考スコア（独自算出の注目度）: 21.143038784114154
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human vision is capable of transforming two-dimensional observations into an egocentric three-dimensional scene understanding, which underpins the ability to translate complex scenes and exhibit adaptive behaviors. This capability, however, remains lacking in current autonomous driving systems, where mainstream approaches primarily rely on depth-based 3D reconstruction rather than true scene understanding. To address this limitation, we propose a novel human-like framework called OmniScene. First, we introduce the OmniScene Vision-Language Model (OmniVLM), a vision-language framework that integrates multi-view and temporal perception for holistic 4D scene understanding. Then, harnessing a teacher-student OmniVLM architecture and knowledge distillation, we embed textual representations into 3D instance features for semantic supervision, enriching feature learning, and explicitly capturing human-like attentional semantics. These feature representations are further aligned with human driving behaviors, forming a more human-like perception-understanding-action architecture. In addition, we propose a Hierarchical Fusion Strategy (HFS) to address imbalances in modality contributions during multimodal integration. Our approach adaptively calibrates the relative significance of geometric and semantic features at multiple abstraction levels, enabling the synergistic use of complementary cues from visual and textual modalities. This learnable dynamic fusion enables a more nuanced and effective exploitation of heterogeneous information. We evaluate OmniScene comprehensively on the nuScenes dataset, benchmarking it against over ten state-of-the-art models across various tasks. Our approach consistently achieves superior results, establishing new benchmarks in perception, prediction, planning, and visual question answering.
Abstract（参考訳）: 人間の視覚は2次元の観察をエゴセントリックな3次元のシーン理解に変換することができ、複雑なシーンを翻訳し、適応的な振る舞いを示す能力の基盤となる。しかし、この能力は現在の自動運転システムに欠けており、メインストリームのアプローチはシーン理解ではなく、主に深度に基づく3D再構築に依存している。この制限に対処するため、我々はOmniSceneと呼ばれる新しい人間のようなフレームワークを提案する。まず,OmniScene Vision-Language Model(OmniVLM)を紹介する。次に,教師が指導するOmniVLMアーキテクチャと知識蒸留を活用して,テキスト表現を3次元のインスタンス機能に組み込むことで,セマンティック・インテリジェンス,特徴学習の充実,ヒューマンライクなアテンショナル・セマンティクスの明確化を実現している。これらの特徴表現は、人間の運転行動とさらに整合し、より人間らしい認識-理解-行動アーキテクチャを形成する。さらに,マルチモーダル統合におけるモダリティ寄与の不均衡に対処する階層的融合戦略(HFS)を提案する。複数の抽象レベルで幾何学的特徴と意味的特徴の相対的重要性を適応的に校正し、視覚的・テキスト的モダリティからの相補的手がかりの相乗的利用を可能にした。この学習可能な動的融合は、よりニュアンスで効果的な異種情報の利用を可能にする。我々はOmniSceneをnuScenesデータセットで総合的に評価し、様々なタスクにわたる10以上の最先端モデルと比較した。我々のアプローチは、認識、予測、計画、視覚的質問応答における新しいベンチマークを確立することで、常に優れた結果を達成する。

論文の概要: OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving

関連論文リスト