Fugu-MT 論文翻訳(概要): A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model

論文の概要: A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model

arxiv url: http://arxiv.org/abs/2604.05672v2
Date: Wed, 08 Apr 2026 08:24:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 14:06:05.101835
Title: A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
Title（参考訳）: A1: 完全透明なオープンソース、適応的で効率のよいトランシティブビジョン・ランゲージ・アクション・モデル
Authors: Kaidong Zhang, Jian Zhang, Rongtao Xu, Yu Sun, Shuoshuo Xue, Youpeng Wen, Xiaoyu Guo, Minghao Guo, Weijia Liufu, Liu Zihou, Kangyi Ji, Yangsong Zhang, Jiarun Zhu, Jingzhi Liu, Zihang Li, Ruiyi Chen, Meng Cao, Jingming Zhang, Shen Zhao, Xiaojun Chang, Feng Zheng, Ivan Laptev, Xiaodan Liang,
Abstract要約: VLA(Vision-Language-Action)モデルは、オープンワールドロボット操作の強力なパラダイムとして登場したが、実際の展開はコストに制約されることが多い。我々は、低コストで高スループットな推論のために設計された、完全にオープンソースで透明なVLAフレームワークであるA1を提示する。 A1は最先端の成功率を達成すると同時に、推論コストを大幅に削減する。
参考スコア（独自算出の注目度）: 112.9420001646428
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have emerged as a powerful paradigm for open-world robot manipulation, but their practical deployment is often constrained by cost: billion-scale VLM backbones and iterative diffusion/flow-based action heads incur high latency and compute, making real-time control expensive on commodity hardware. We present A1, a fully open-source and transparent VLA framework designed for low-cost, high-throughput inference without sacrificing manipulation success; Our approach leverages pretrained VLMs that provide implicit affordance priors for action generation. We release the full training stack (training code, data/data-processing pipeline, intermediate checkpoints, and evaluation scripts) to enable end-to-end reproducibility. Beyond optimizing the VLM alone, A1 targets the full inference pipeline by introducing a budget-aware adaptive inference scheme that jointly accelerates the backbone and the action head. Specifically, we monitor action consistency across intermediate VLM layers to trigger early termination, and propose Inter-Layer Truncated Flow Matching that warm-starts denoising across layers, enabling accurate actions with substantially fewer effective denoising iterations. Across simulation benchmarks (LIBERO, VLABench) and real robots (Franka, AgiBot), A1 achieves state-of-the-art success rates while significantly reducing inference cost (e.g., up to 72% lower per-episode latency for flow-matching inference and up to 76.6% backbone computation reduction with minor performance degradation). On RoboChallenge, A1 achieves an average success rate of 29.00%, outperforming baselines including pi0(28.33%), X-VLA (21.33%), and RDT-1B (15.00%).
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、オープンワールドロボット操作の強力なパラダイムとして登場したが、その実践的展開はコストによって制約されることが多い。本稿では,操作を犠牲にすることなく低コストで高スループットの推論が可能な,完全オープンソースかつ透過的なVLAフレームワークであるA1について述べる。エンドツーエンドの再現性を実現するための完全なトレーニングスタック(トレーニングコード、データ/データ処理パイプライン、中間チェックポイント、評価スクリプト)をリリースします。 VLMのみを最適化する以外に、A1は、バックボーンとアクションヘッドを共同で高速化する予算対応適応推論スキームを導入することで、完全な推論パイプラインをターゲットにしている。具体的には、中間VLM層間でのアクション一貫性を監視して早期終了を誘導し、階層間でのウォームスタートを行う階層間整合フローマッチングを提案する。シミュレーションベンチマーク(LIBERO、VLABench)と実ロボット(Franka、AgiBot)のA1は、予測コストを著しく削減しつつ、最先端の成功率を達成する(例えば、フローマッチング推論において、エピソード当たりのレイテンシを最大72%削減し、パフォーマンスの低下を最大76.6%削減する)。 RoboChallengeでは、A1は平均成功率29.00%に達し、pi0(28.33%)、X-VLA(21.33%)、RTT-1B(15.00%)などのベースラインを上回っている。

論文の概要: A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model

関連論文リスト