Fugu-MT 論文翻訳(概要): DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

論文の概要: DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2603.22280v1
Date: Mon, 23 Mar 2026 17:59:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.838514
Title: DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models
Title（参考訳）: DualCoT-VLA:視覚言語行動モデルのための並列推論による思考の視覚言語的連鎖
Authors: Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong, Guanyi Zhao, Yingjie Cai, Jiantao Gao, Xu Yan, Bingbing Liu, Yingcong Chen, Liuqing Yang, Haoang Li,
Abstract要約: VLA(Vision-Language-Action)モデルは、視覚的な観察と言語指示を直接ロボット行動にマッピングする。近年の取り組みは、行動能力の前に思考でVLAモデルを育むために、Chain-of-Thought (CoT)推論を取り入れている。並列推論機構を持つVLAモデルの視覚言語的CoT法であるDualCoT-VLAを提案する。
参考スコア（独自算出の注目度）: 50.07453075750711
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting'' capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、視覚的な観察と言語指示を直接ロボット行動にマッピングする。単純なタスクには有効であるが、標準的なVLAモデルは、論理的な計画を必要とする複雑な多段階的なタスクや、きめ細かい空間認識を必要とする正確な操作に苦慮することが多い。近年,「演技前に考える」能力を持つVLAモデルを実現するために,Chain-of-Thought (CoT)推論を取り入れている。しかし、現在のCoTベースのVLAモデルは2つの限界に直面している。 1) 孤立した単一モードCoTに依存するため、低レベルの視覚的詳細と高レベルの論理的計画が同時に取得できないこと。 2) ステップ・バイ・ステップの自己回帰復号による複雑なエラーを伴う高い推論遅延。これらの制約に対処するため、並列推論機構を持つVLAモデルの視覚言語CoT法であるDualCoT-VLAを提案する。低レベル空間理解のための視覚的CoTと高レベルタスク計画のための言語的CoTを統合する。さらに、レイテンシのボトルネックを克服するため、2組の学習可能なクエリトークンを組み込んだ並列CoT機構を導入し、自動回帰推論をシングルステップ前方推論にシフトする。我々のDualCoT-VLAは、LIBEROとRoboCasa GR1ベンチマーク、および現実世界のプラットフォーム上で、最先端のパフォーマンスを実現しています。

論文の概要: DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

関連論文リスト