Fugu-MT 論文翻訳(概要): StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

論文の概要: StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

arxiv url: http://arxiv.org/abs/2603.28565v1
Date: Mon, 30 Mar 2026 15:23:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.474909
Title: StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation
Title（参考訳）: StreamingVLA: アクションフローマッチングと適応的初期観測による視覚・言語・行動モデル
Authors: Yiran Shi, Dongqi Guo, Tianchen Zhao, Feng Gao, Liangzhi Shi, Chao Yu, ZhiJian Mo, Qihua Xiao, XiaoShuai Peng, Qingmin Liao, Yu Wang,
Abstract要約: 視覚言語アクション(VLA)モデルは、自然言語による知覚と制御において例外的な性能を示した。 VLAモデルの高い計算コストは、大きな効率上の課題をもたらす。本稿では,VLAステージ間で非同期並列化が可能なVLAを実現することを提案する。
参考スコア（独自算出の注目度）: 30.881585159777714
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language-action (VLA) models have demonstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for resource-constrained edge platforms in real-world deployments. However, since different stages of VLA (observation, action generation and execution) must proceed sequentially, and wait for the completion of the preceding stage, the system suffers from frequent halting and high latency. To address this, We conduct a systematic analysis to identify the challenges for fast and fluent generation, and propose enabling VLAs with the ability to asynchronously parallelize across VLA stages in a "streaming" manner. First, we eliminate the reliance on action chunking and adopt action flow matching, which learns the trajectory of action flows rather than denoising chunk-wise actions. It overlaps the latency of action generation and execution. Second, we design an action saliency-aware adaptive observation mechanism, thereby overlapping the latency of execution and observation. Without sacrificing performance, StreamingVLA achieves substantial speedup and improves the fluency of execution. It achieves a 2.4 $\times$ latency speedup and reduces execution halting by 6.5 $\times$.
Abstract（参考訳）: 視覚言語アクション(VLA)モデルは、自然言語による知覚と制御において例外的な性能を示した。しかしながら、VLAモデルの高い計算コストは、特に実世界の展開においてリソース制約のあるエッジプラットフォームにおいて、大きな効率上の課題をもたらす。しかしながら、VLAの異なる段階(観測、行動生成、実行)は順次進行し、前段階の完了を待つ必要があるため、システムは頻繁な停止と高いレイテンシに悩まされる。この問題に対処するため,我々は,高速かつ流動的な生成の課題を特定するための系統解析を行い,VLAのステージ間で非同期並列化が可能なVLAの実現を,ストリーミング方式で提案する。まず、アクションチャンキングへの依存を排除し、アクションフローマッチングを採用する。アクション生成と実行のレイテンシが重なる。第2に,動作の順応性を考慮した適応観察機構を設計し,実行と観測の遅延を重畳する。パフォーマンスを犠牲にすることなく、StreamingVLAは大幅にスピードアップし、実行頻度を向上させる。 2.4$\times$レイテンシのスピードアップを実現し、実行停止を6.5$\times$に削減する。

論文の概要: StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

関連論文リスト