Fugu-MT 論文翻訳(概要): Understanding Asynchronous Inference Methods for Vision-Language-Action Models

論文の概要: Understanding Asynchronous Inference Methods for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2605.08168v1
Date: Mon, 04 May 2026 18:01:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.417315
Title: Understanding Asynchronous Inference Methods for Vision-Language-Action Models
Title（参考訳）: 視覚・言語・行動モデルのための非同期推論手法の理解
Authors: Ayoub Agouzoul,
Abstract要約: Vision-Language-Action(VLA)モデルは汎用ロボット制御への有望な経路を提供するが、その推論遅延は、生成されたアクションが非同期に実行されるときに観察の安定化を引き起こす。 Inference-time Inpainting (IT-RTC), training-time delay Simulation (TT-RTC), future-state-aware conditioning (VLASH), light residual correction (A2C2) の2つの手法が同時に提案されている。制御条件下でのこれらの4つの手法の系統的な比較について述べる。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action (VLA) models offer a promising path to generalist robot control, but their inference latency causes observation staleness when generated actions are executed asynchronously. Several methods have been proposed concurrently to mitigate this problem: inference-time inpainting (IT-RTC), training-time delay simulation (TT-RTC), future-state-aware conditioning (VLASH), and lightweight residual correction (A2C2). Each takes a fundamentally different approach, but they have so far been evaluated independently with different codebases, base policies, and protocols. We present a systematic comparison of these four methods under controlled conditions. We develop two unified codebases that integrate all methods with harmonized library and dataset versions, and we benchmark them on the Kinetix suite with MLPMixer policies and on the LIBERO manipulation benchmark with SmolVLA, sweeping inference delays up to $d=20$ control steps. A2C2's per-step residual correction is the most effective method on Kinetix, holding above 90% solve rate up to $d=8$, and also leads on LIBERO from $d=4$ onwards. IT-RTC is competitive at low delays but degrades sharply under long chunks ($H=30$) and high delays. TT-RTC is the most robust training-based method: stable across $d_\max$ choices, generalizes beyond its training delay distribution, and adds zero inference overhead. VLASH exhibits a clear low-delay vs. high-delay trade-off governed by the fine-tuning delay range $[0,d_\max]$. Code is available at https://github.com/TheAyos/async-vla-inference
Abstract（参考訳）: Vision-Language-Action(VLA)モデルは汎用ロボット制御への有望な経路を提供するが、その推論遅延は、生成されたアクションが非同期に実行されるときに観察の安定化を引き起こす。 Inference-time Inpainting (IT-RTC), training-time delay Simulation (TT-RTC), future-state-aware conditioning (VLASH), light residual correction (A2C2) の2つの手法が同時に提案されている。それぞれが根本的に異なるアプローチを取るが、これまでは異なるコードベース、基本ポリシー、プロトコルで独立して評価されてきた。制御条件下でのこれらの4つの手法の系統的な比較について述べる。我々は、すべてのメソッドを調和したライブラリとデータセットバージョンに統合する2つの統一コードベースを開発し、それらをMLPMixerポリシーのKinetixスイートとSmolVLAのLIBEROベンチマークでベンチマークし、最大$d=20$コントロールステップまで推論遅延を網羅する。 A2C2のステップごとの残留補正はキネティクスで最も効果的な方法であり、90%以上の解率を$d=8$まで保持し、さらに$d=4$以降のLIBEROを導く。 IT-RTCは低い遅延で競争力があるが、長いチャンク(H=30ドル)と高い遅延で急激に劣化する。 TT-RTCは最も堅牢なトレーニングベースのメソッドである。$d_\max$の選択を安定させ、トレーニング遅延分布を超えて一般化し、推論オーバーヘッドをゼロにする。 VLASHは、微調整遅延範囲$[0,d_\max]$が支配する、明確な低遅延対高遅延トレードオフを示す。コードはhttps://github.com/TheAyos/async-vla-inferenceで入手できる。

論文の概要: Understanding Asynchronous Inference Methods for Vision-Language-Action Models

関連論文リスト