Fugu-MT 論文翻訳(概要): TBD-VLA: Temporal Block Diffusion Vision Language Action Model

論文の概要: TBD-VLA: Temporal Block Diffusion Vision Language Action Model

arxiv url: http://arxiv.org/abs/2606.07895v1
Date: Fri, 05 Jun 2026 23:10:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:05.519119
Title: TBD-VLA: Temporal Block Diffusion Vision Language Action Model
Title（参考訳）: TBD-VLA:時間ブロック拡散ビジョン言語行動モデル
Authors: Sung-Wook Lee, Xuhui Kang, Yen-Ling Kuo,
Abstract要約: 本稿では,ブロック拡散を組み込んだ離散トークンベースのVLAフレームワークTBD-VLAを紹介する。動作シーケンスを時間ブロックに分割し、ブロック間の自己回帰生成を維持しながら、各ブロック内でマスキングされた離散拡散を行う。この設計は時間的自己回帰と並列動作復号を統一し、強い時間的コヒーレンスと推論速度の向上を両立させる。
参考スコア（独自算出の注目度）: 7.861095039299131
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. While effective, this paradigm incurs high inference latency and largely ignores the temporal structure inherent in action trajectories. Recent efforts introduce parallel decoding to improve efficiency, enabling faster inference, but lack explicit mechanisms for modeling token dependencies. We introduce TBD-VLA, a discrete token-based VLA framework that incorporates block diffusion to enable temporal action generation. We partition action sequences into temporal blocks and perform masked discrete diffusion within each block, while maintaining autoregressive generation across blocks. This design unifies temporal autoregression and parallel action decoding, achieving both strong temporal coherence and improved inference speed. In addition, the explicit temporal modeling enables asynchronous execution of action chunks (e.g., Real-Time Chunking) via temporal in-painting. TBD-VLA significantly outperforms prior VLA approaches in both simulation and real-world manipulation tasks, offering a scalable path toward fast, temporally aware, discrete VLA models. Project webpage: https://tbd-vla.github.io/
Abstract（参考訳）: 離散ビジョン・ランゲージ・アクション(VLA)モデルは、通常、離散化されたアクション空間上での次のトーケン予測としてアクション生成を定式化し、それぞれのトークンを事前の文脈で自己回帰的に条件付けする。効果はあるものの、このパラダイムは高い推論遅延を引き起こし、動作軌跡に固有の時間構造をほとんど無視する。最近の取り組みでは、並列デコーディングを導入して効率を向上し、推論の高速化を実現しているが、トークン依存性をモデリングするための明確なメカニズムが欠如している。本稿では,ブロック拡散を組み込んだ離散トークンベースのVLAフレームワークTBD-VLAを紹介する。動作シーケンスを時間ブロックに分割し、ブロック間の自己回帰生成を維持しながら、各ブロック内でマスキングされた離散拡散を行う。この設計は時間的自己回帰と並列動作復号を統一し、強い時間的コヒーレンスと推論速度の向上を両立させる。さらに、明示的な時間的モデリングにより、時間的インペイントによるアクションチャンク(例:リアルタイムチャンキング)の非同期実行が可能になる。 TBD-VLAは、シミュレーションと実世界の操作タスクの両方において、従来のVLAアプローチよりも大幅に優れており、高速で時間的に認識された離散VLAモデルへのスケーラブルなパスを提供する。プロジェクトWebページ: https://tbd-vla.github.io/

論文の概要: TBD-VLA: Temporal Block Diffusion Vision Language Action Model

関連論文リスト