Fugu-MT 論文翻訳(概要): TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

論文の概要: TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

arxiv url: http://arxiv.org/abs/2606.06491v1
Date: Thu, 04 Jun 2026 17:59:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:45.036959
Title: TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies
Title（参考訳）: TempoVLA: 速度制御可能なビジョンランゲージアクションポリシーを学習する
Authors: Dong Jing, Jingchen Nie, Tianqi Zhang, Jiaqi Liu, Huaxiu Yao, Zhiwu Lu, Mingyu Ding,
Abstract要約: 既存のVision-Language-Action Model (VLA) は、トレーニングデモから1つの固定速度を継承するのみである。予測された各行動の大きさが、ロボットの動きの速さをすでに支配していることを観察する。我々はこの観測結果を、明示的な条件で実行速度を制御する単一のVLAであるTempoVLAに変換する。
参考スコア（独自算出の注目度）: 58.40033352838586
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from one fixed speed to another, and leave deceleration almost unexplored. We observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. We turn this observation into TempoVLA, a single VLA whose execution speed is controlled by an explicit condition. TempoVLA combines two coupled components. (1) A data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstration to any target speed by merging or splitting actions while preserving its motion semantics. (2) A model-side conditioning mechanism that feeds the speed to the policy. Statistics show that VSTA reaches the requested speed with negligible motion error. Experiments in simulation and on real-world tasks demonstrate that TempoVLA achieves flexible speed control in both directions, while VSTA additionally boosts the default $1\times$ performance via better data utilization. Furthermore, by cooperating with a large multimodal model, TempoVLA realizes dynamic speed control, accelerating through low-risk phases and decelerating for high-risk ones.
Abstract（参考訳）: ロボットの操作は、高速な実行を要求する低リスクのトランジットフェーズと、スローで正確な動きを要求する高リスクのコンタクトステージを交互に行う。しかし、既存のVision-Language-Action Model (VLA) は、トレーニングデモから1つの固定速度しか継承していない。モデル圧縮、KV-cache再利用、強化学習を通じてVLAを加速する以前の取り組みは、ポリシーを1つの一定の速度から別の速度にのみシフトさせ、減速をほとんど未探索のままにしておく。予測された動作の大きさはロボットの動きの速さを制御し、制御可能な実行速度への直接の経路を開くことを観察する。我々はこの観測結果を、明示的な条件で実行速度を制御する単一のVLAであるTempoVLAに変換する。 TempoVLAは2つの結合コンポーネントを組み合わせる。 1) 動作のセマンティクスを保ちながら動作をマージまたは分割することにより、任意の目標速度に復調するデータ側可変高速軌道拡張(VSTA)。 2)ポリシーに速度を供給するモデル側条件付け機構。統計学により、VSTAは無視可能な動作誤差で要求された速度に達することが示されている。シミュレーションと実世界のタスクの実験では、TempoVLAが両方向の柔軟な速度制御を実現しているのに対し、VSTAはデータ利用の改善によってデフォルトの1ドル/times$パフォーマンスも向上している。さらに,TempoVLAは,大規模マルチモーダルモデルと協調して動的速度制御を実現し,低リスクフェーズを加速し,高リスクモデルでは減速する。

論文の概要: TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

関連論文リスト