Fugu-MT 論文翻訳(概要): VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

論文の概要: VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2604.19728v1
Date: Tue, 21 Apr 2026 17:51:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.911403
Title: VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
Title（参考訳）: VLA Foundry:ビジョンランゲージ・アクションモデルをトレーニングするための統一フレームワーク
Authors: Jean Mercat, Sedrick Keh, Kushal Arora, Isabella Huang, Paarth Shah, Haruki Nishimura, Shun Iwase, Katherine Liu,
Abstract要約: 我々は、単一のスタックでLLM、VLM、VLAトレーニングを統合するオープンソースのフレームワークであるVLA Foundryを紹介します。 VLA Foundryは、Hugging Faceからのストロースクラッチトレーニングと事前トレーニングバックボーンの両方をサポートする。オープンソースシミュレータである LBM Eval 上で, 両モデルのクローズドループポリシ性能を評価した。
参考スコア（独自算出の注目度）: 11.774960393195052
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI-ML/vla_foundry and all multi-task model weights are released on https://huggingface.co/collections/TRI-ML/vla-foundry. Additional qualitative videos are available on the project website https://tri-ml.github.io/vla_foundry.
Abstract（参考訳）: 私たちは、単一のコードベースでLLM、VLM、VLAトレーニングを統合するオープンソースのフレームワークであるVLA Foundryを紹介します。ほとんどのオープンソースのVLAの取り組みは、アクショントレーニングのステージに特化しており、互換性のない事前トレーニングパイプラインを縫合することが多い。 VLA Foundryは、言語事前トレーニングからアクションエキスパートの微調整まで、エンドツーエンドのコントロールを備えた共有トレーニングスタックを提供する。 VLA Foundryは、Hugging Faceからのストロースクラッチトレーニングと事前トレーニングバックボーンの両方をサポートする。 LLM-->VLM-->VLAパイプラインを通じてゼロから完全にトレーニングされた最初のモデルと、事前トレーニングされたQwen3-VLバックボーン上に構築された第2のモデルです。オープンソースシミュレータである LBM Eval 上で, 両モデルのクローズドループポリシ性能を評価した。また,シミュレータやSTEP解析ツールのユーザビリティ向上にも貢献し,公共利用が容易になった。名目評価設定では、当社の完全オープン・アウト・スクラッチモデルは、これまでのクローズドソースの作業と同等であり、Qwen3-VLバックボーンに置換することで、強力なマルチタスクテーブルトップ操作ポリシーがベースラインよりも広いマージンで優れています。 VLA Foundryのコードベースはhttps://github.com/TRI-ML/vla_foundryで公開されており、マルチタスクモデルの重み付けはすべてhttps://huggingface.co/collections/TRI-ML/vla-foundryでリリースされている。その他の定性的ビデオはプロジェクトのWebサイトhttps://tri-ml.github.io/vla_foundry.comで公開されている。

論文の概要: VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

関連論文リスト