Fugu-MT 論文翻訳(概要): Intend, Reflect, Refine: An Adaptive Multimodal Reflection Framework for Autonomous Driving

論文の概要: Intend, Reflect, Refine: An Adaptive Multimodal Reflection Framework for Autonomous Driving

arxiv url: http://arxiv.org/abs/2606.22913v1
Date: Mon, 22 Jun 2026 06:53:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 03:39:01.9006
Title: Intend, Reflect, Refine: An Adaptive Multimodal Reflection Framework for Autonomous Driving
Title（参考訳）: Intend, Reflect, Refine: 自律運転のための適応型マルチモーダルリフレクションフレームワーク
Authors: Zisheng Chen, Yuping Qiu, Jianhua Han, Tao Tang, Xiuwei Chen, Likui Zhang, Ying-Cong Chen, Hang Xu, Xiaodan Liang,
Abstract要約: 自律運転のための適応型マルチモーダルリフレクションフレームワークであるIRR-Driveを提案する。 IRR-Driveは明らかにシーンの進化を予測し、モデルは厳密な自己修正と初期意図の洗練を可能にした。提案手法はPDMSとNAVSIMSのベンチマークにおける最先端性能を実現する。
参考スコア（独自算出の注目度）: 79.71689985927625
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent Vision-Language-Action (VLA) models have advanced end-to-end autonomous driving by incorporating reasoning for better interpretability and planning quality. However, most existing approaches directly generate the final trajectory without explicitly examining its future consequences, which limits their reliability in complex and dynamic environments. To address this limitation, we propose IRR-Drive (Intend, Reflect, Refine), an adaptive multimodal reflection framework for autonomous driving. Specifically, to tightly couple high-level reasoning with physical constraints, IRR-Drive first generates a preliminary textual intention and anticipates potential interactions by predicting future semantic bird's-eye view (BEV) representations. This dual-modality (Text + BEV) reflection space explicitly models anticipated scene evolution, enabling the model to rigorously self-correct and refine its initial intent before generating the final trajectory. Furthermore, to balance planning performance and computational efficiency, we construct reflection-oriented training data and design an adaptive reflection reward, enabling the model to adaptively select its reasoning mode according to scene complexity. Instead of using reasoning primarily as an auxiliary interpretation, IRR-Drive directly integrates an adaptive reflection mechanism into the planning framework, enabling grounded, decision-aware trajectory correction that is driven by scene complexity. Our method achieves state-of-the-art performance on the NAVSIM benchmark in both PDMS and EPDMS. Extensive experiments demonstrate the effectiveness of our multimodal reflection framework and validate the efficacy of the proposed adaptive reflection strategy.
Abstract（参考訳）: 近年のVision-Language-Action(VLA)モデルでは、より優れた解釈可能性と計画品質の推論を取り入れて、エンドツーエンドの自動運転が進歩している。しかし、既存のほとんどのアプローチは、将来の結果を明確に調べることなく、最終的な軌道を直接生成し、複雑な環境や動的環境における信頼性を制限している。この制限に対処するため、自律運転のための適応型マルチモーダルリフレクションフレームワークであるIRR-Drive(Intend, Reflect, Refine)を提案する。具体的には、高レベルの推論を物理的制約と密に結合するために、IRR-Driveはまず予備的なテキスト意図を生成し、将来のセマンティック・バードズ・アイ・ビュー(BEV)表現を予測することによって潜在的な相互作用を予測する。この双対モダリティ(Text + BEV)反射空間は、予想されるシーンの進化を明示的にモデル化し、最終的な軌道を生成する前に、モデルを厳密に自己修正し、その最初の意図を洗練することができる。さらに、計画性能と計算効率のバランスをとるために、リフレクション指向のトレーニングデータを構築し、適応的なリフレクション報酬を設計し、シーンの複雑さに応じて推論モードを適応的に選択できるようにする。推論を補助的な解釈として使用する代わりに、IRR-Driveは適応反射機構を直接計画フレームワークに統合し、シーンの複雑さによって引き起こされる、基底的かつ決定に敏感な軌道修正を可能にする。本手法はPDMSとEPDMSの両方においてNAVSIMベンチマークの最先端性能を実現する。提案手法の有効性を実証し, 適応反射法の有効性を検証した。

論文の概要: Intend, Reflect, Refine: An Adaptive Multimodal Reflection Framework for Autonomous Driving

関連論文リスト