Fugu-MT 論文翻訳(概要): DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

論文の概要: DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

arxiv url: http://arxiv.org/abs/2603.10469v1
Date: Wed, 11 Mar 2026 06:40:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-21 18:33:56.669077
Title: DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference
Title（参考訳）: DepthCache: ビジョンランゲージ・アクションモデル推論のための深層誘導型トレーニングフリービジュアルトークンマージ
Authors: Yuquan Li, Lianjie Ma, Han Ding, Lijun Zhu,
Abstract要約: VLA(Vision-Language-Action)モデルは、一般的なロボット操作を可能にするが、高い推論遅延に悩まされる。 DepthCacheは、ビジュアルトークン圧縮のための構造的事前として奥行きを利用する、トレーニング不要のフレームワークである。 LIBEROベンチマークでは、DepthCacheは最大1.28倍の推論スピードアップを達成する。
参考スコア（独自算出の注目度）: 5.305950698447464
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models enable generalist robotic manipulation but suffer from high inference latency. This bottleneck stems from the massive number of visual tokens processed by large language backbones. Existing methods either prune or merge tokens uniformly, degrading the spatial reasoning essential for robotic control. We present DepthCache, a training-free framework that leverages depth as a structural prior for visual token compression. It partitions observations into depth-based regions and applies spatially differentiated merge ratios, preserving the near-field workspace while compressing the distant background. To exploit temporal redundancy, DepthCache distributes the merging process across consecutive frames, ensuring consistent representations while reducing per-step computation. A motion-adaptive pipeline further optimizes auxiliary view compression based on end-effector dynamics. The framework requires no model modification, generalizing across diverse VLA architectures. On the LIBERO benchmark, DepthCache achieves up to 1.28x inference speedup with less than 1% average success rate degradation across three VLA models (pi_0.5, OpenVLA, GR00T), whereas pruning and merging baselines incur 4--24% degradation at comparable compression. Real-world experiments on a physical manipulator demonstrate that DepthCache enables faster task throughput and more responsive closed-loop control in latency-sensitive scenarios.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、一般的なロボット操作を可能にするが、高い推論遅延に悩まされる。このボトルネックは、大きな言語バックボーンによって処理される膨大な数の視覚トークンに起因している。既存の方法では、プルーンまたはマージトークンが均一に行われ、ロボット制御に不可欠な空間的推論が劣化する。 DepthCacheは、ビジュアルトークン圧縮のための構造的事前として奥行きを利用する、トレーニング不要のフレームワークである。観測を深度に基づく領域に分割し、空間的に区別されたマージ比を適用し、遠くの背景を圧縮しながら近接場ワークスペースを保存する。時間的冗長性を活用するために、DepthCacheはマージプロセスを連続したフレームに分散し、ステップ単位の計算を削減しながら一貫した表現を保証する。モーション適応パイプラインは、エンドエフェクタダイナミクスに基づいた補助的なビュー圧縮をさらに最適化する。このフレームワークはモデル修正を必要とせず、多様なVLAアーキテクチャをまたいで一般化する。 LIBEROベンチマークでは、DepthCacheは3つのVLAモデル(pi_0.5、OpenVLA、GR00T)で平均成功率を1%以下に抑えながら、最大1.28倍の推論スピードアップを達成する。物理マニピュレータを用いた実世界の実験では、DepthCacheは、レイテンシに敏感なシナリオにおいて、より高速なタスクスループットとより応答性の高いクローズループ制御を実現する。

論文の概要: DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

関連論文リスト