Fugu-MT 論文翻訳(概要): Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models

論文の概要: Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2602.01166v1
Date: Sun, 01 Feb 2026 11:34:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.064085
Title: Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models
Title（参考訳）: 潜時推論VLA:視覚-言語-行動モデルにおける潜時思考と予測
Authors: Shuanghao Bai, Jing Lyu, Wanqi Zhou, Zhe Li, Dakai Wang, Lei Xing, Xiaoguang Zhao, Pengwei Wang, Zhongyuan Wang, Cheng Chi, Badong Chen, Shanghang Zhang,
Abstract要約: VLA(Vision-Language-Action)モデルは、チェーン・オブ・思想(CoT)推論の恩恵を受けるが、既存のアプローチでは高い推論オーバーヘッドが生じる。本稿では,マルチモーダル CoT 推論を具体化するための連続潜時表現に内包する統合 VLA フレームワークである Latent Reasoning VLA (textbfLaRA-VLA) を提案する。
参考スコア（独自算出の注目度）: 69.58413440457828
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action (VLA) models benefit from chain-of-thought (CoT) reasoning, but existing approaches incur high inference overhead and rely on discrete reasoning representations that mismatch continuous perception and control. We propose Latent Reasoning VLA (\textbf{LaRA-VLA}), a unified VLA framework that internalizes multi-modal CoT reasoning into continuous latent representations for embodied action. LaRA-VLA performs unified reasoning and prediction in latent space, eliminating explicit CoT generation at inference time and enabling efficient, action-oriented control. To realize latent embodied reasoning, we introduce a curriculum-based training paradigm that progressively transitions from explicit textual and visual CoT supervision to latent reasoning, and finally adapts latent reasoning dynamics to condition action generation. We construct two structured CoT datasets and evaluate LaRA-VLA on both simulation benchmarks and long-horizon real-robot manipulation tasks. Experimental results show that LaRA-VLA consistently outperforms state-of-the-art VLA methods while reducing inference latency by up to 90\% compared to explicit CoT-based approaches, demonstrating latent reasoning as an effective and efficient paradigm for real-time embodied control. Project Page: \href{https://loveju1y.github.io/Latent-Reasoning-VLA/}{LaRA-VLA Website}.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、チェーン・オブ・ソート(CoT)推論の恩恵を受けるが、既存のアプローチでは高い推論オーバーヘッドが発生し、連続的な知覚と制御をミスマッチする離散的推論表現に依存している。本稿では,マルチモーダルCoT推論を連続潜時表現に内部化する統一VLAフレームワークであるLatent Reasoning VLA(\textbf{LaRA-VLA})を提案する。 LaRA-VLAは、潜在空間における統一的推論と予測を行い、推論時に明示的なCoT生成を排除し、効率的なアクション指向制御を可能にする。潜伏型推論を実現するために,カリキュラムベースの学習パラダイムを導入し,明示的なテキストと視覚的CoT監督から潜伏型推論へと段階的に移行し,最終的に潜伏型推論ダイナミクスを条件行動生成に適用する。我々は2つの構造化されたCoTデータセットを構築し、シミュレーションベンチマークと長距離実ロボット操作タスクの両方でLaRA-VLAを評価する。実験結果から,LaRA-VLAは予測遅延を最大90倍まで低減しつつ,最先端VLA法よりも高い性能を示し,リアルタイムエンボダイド制御の効率的かつ効率的なパラダイムとして潜時推論を実証した。プロジェクトページ: \href{https://loveju1y.github.io/Latent-Reasoning-VLA/}{LaRA-VLA Website}

論文の概要: Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models

関連論文リスト