Fugu-MT 論文翻訳(概要): Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning

論文の概要: Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning

arxiv url: http://arxiv.org/abs/2512.24426v1
Date: Tue, 30 Dec 2025 19:04:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.57805
Title: Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning
Title（参考訳）: 対物VLA:適応推論を用いた自己反射型視覚言語反応モデル
Authors: Zhenghao "Mark" Peng, Wenhao Ding, Yurong You, Yuxiao Chen, Wenjie Luo, Thomas Tian, Yulong Cao, Apoorva Sharma, Danfei Xu, Boris Ivanovic, Boyi Li, Bolei Zhou, Yan Wang, Marco Pavone,
Abstract要約: この作業では、モデルが実行前に計画されたアクションを推論し、修正することを可能にする、自己修正型のVLAフレームワークである、Counterfactual VLAを導入している。 CF-VLAはまず、駆動意図を要約した時間分割メタアクションを生成し、その後、メタアクションと視覚コンテキストの両方で条件付けられた反実的推論を実行する。大規模運転データセットの実験では、CF-VLAは軌道精度を最大17.6%向上し、安全基準を20.5%向上し、適応的思考を示す。
参考スコア（独自算出の注目度）: 71.19675094463834
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent reasoning-augmented Vision-Language-Action (VLA) models have improved the interpretability of end-to-end autonomous driving by generating intermediate reasoning traces. Yet these models primarily describe what they perceive and intend to do, rarely questioning whether their planned actions are safe or appropriate. This work introduces Counterfactual VLA (CF-VLA), a self-reflective VLA framework that enables the model to reason about and revise its planned actions before execution. CF-VLA first generates time-segmented meta-actions that summarize driving intent, and then performs counterfactual reasoning conditioned on both the meta-actions and the visual context. This step simulates potential outcomes, identifies unsafe behaviors, and outputs corrected meta-actions that guide the final trajectory generation. To efficiently obtain such self-reflective capabilities, we propose a rollout-filter-label pipeline that mines high-value scenes from a base (non-counterfactual) VLA's rollouts and labels counterfactual reasoning traces for subsequent training rounds. Experiments on large-scale driving datasets show that CF-VLA improves trajectory accuracy by up to 17.6%, enhances safety metrics by 20.5%, and exhibits adaptive thinking: it only enables counterfactual reasoning in challenging scenarios. By transforming reasoning traces from one-shot descriptions to causal self-correction signals, CF-VLA takes a step toward self-reflective autonomous driving agents that learn to think before they act.
Abstract（参考訳）: 近年のVLAモデルでは、中間的推論トレースを生成することにより、エンドツーエンドの自律運転の解釈性が改善されている。しかしこれらのモデルは、主に彼らが認識し、何を意図しているかを記述しており、計画されたアクションが安全かどうかを疑うことは滅多にない。この研究は、モデルが実行前に計画されたアクションを推論し、修正することを可能にする自己反射型VLAフレームワークであるCF-VLA(Counterfactual VLA)を紹介する。 CF-VLAはまず、駆動意図を要約した時間分割メタアクションを生成し、その後、メタアクションと視覚コンテキストの両方で条件付けられた反実的推論を実行する。このステップは潜在的な成果をシミュレートし、安全でない振る舞いを特定し、最終的な軌道生成を導く修正されたメタアクションを出力する。このような自己回帰能力を効率よく得るために,基地(非事実)VLAのロールアウトから高価値シーンをマイニングするロールアウトフィルタラベルパイプラインと,その後のトレーニングラウンドにおける偽の推論トレースをラベルとして提案する。大規模運転データセットの実験では、CF-VLAは軌道精度を最大17.6%向上し、安全基準を20.5%向上し、適応的な思考を示す。 CF-VLAは、ワンショット記述から因果的自己補正信号への推論トレースを変換することで、行動する前に考えることを学ぶ自己反射型自律運転エージェントへの一歩を踏み出した。

論文の概要: Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning

関連論文リスト