Fugu-MT 論文翻訳(概要): StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

論文の概要: StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

arxiv url: http://arxiv.org/abs/2604.05014v1
Date: Mon, 06 Apr 2026 17:59:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-08 17:42:09.408852
Title: StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing
Title（参考訳）: StarVLA:ビジョン・ランゲージ・アクション・モデル開発のためのレゴのようなコードベース
Authors: StarVLA Community,
Abstract要約: ジェネラリストの具体化エージェントを構築するには、知覚、言語理解、行動の統合が必要である。本稿では、Vision-Language-Action ResearchのオープンソースであるStarVLAを紹介する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Building generalist embodied agents requires integrating perception, language understanding, and action, which are core capabilities addressed by Vision-Language-Action (VLA) approaches based on multimodal foundation models, including recent advances in vision-language models and world models. Despite rapid progress, VLA methods remain fragmented across incompatible architectures, codebases, and evaluation protocols, hindering principled comparison and reproducibility. We present StarVLA, an open-source codebase for VLA research. StarVLA addresses these challenges in three aspects. First, it provides a modular backbone--action-head architecture that supports both VLM backbones (e.g., Qwen-VL) and world-model backbones (e.g., Cosmos) alongside representative action-decoding paradigms, all under a shared abstraction in which backbone and action head can each be swapped independently. Second, it provides reusable training strategies, including cross-embodiment learning and multimodal co-training, that apply consistently across supported paradigms. Third, it integrates major benchmarks, including LIBERO, SimplerEnv, RoboTwin~2.0, RoboCasa-GR1, and BEHAVIOR-1K, through a unified evaluation interface that supports both simulation and real-robot deployment. StarVLA also ships simple, fully reproducible single-benchmark training recipes that, despite minimal data engineering, already match or surpass prior methods on multiple benchmarks with both VLM and world-model backbones. To our best knowledge, StarVLA is one of the most comprehensive open-source VLA frameworks available, and we expect it to lower the barrier for reproducing existing methods and prototyping new ones. StarVLA is being actively maintained and expanded; we will update this report as the project evolves. The code and documentation are available at https://github.com/starVLA/starVLA.
Abstract（参考訳）: 汎用的なエンボディードエージェントの構築には、視覚言語モデルや世界モデルの最近の進歩を含むマルチモーダル基盤モデルに基づくビジョン・ランゲージ・アクション(VLA)アプローチによって対処されるコア機能である知覚、言語理解、行動の統合が必要である。急速な進歩にもかかわらず、VLAメソッドは互換性のないアーキテクチャ、コードベース、評価プロトコルで断片化され、原則的な比較と再現性を妨げている。本稿では,VLA研究のためのオープンソースコードベースであるStarVLAを紹介する。 StarVLAはこれらの課題に3つの側面で対処する。まず、モジュラーバックボーン-アクションヘッドアーキテクチャを提供し、VLMバックボーン(例えば、Qwen-VL)と世界モデルバックボーン(例えば、コスモス)の両方をサポートする。第2に、クロス・エボディメント・ラーニングやマルチモーダル・コトレーニングを含む再利用可能なトレーニング戦略を提供し、サポート対象のパラダイムに一貫して適用する。第3に、LIBERO、SimplerEnv、RoboTwin~2.0、RoboCasa-GR1、BEHAVIOR-1Kといった主要なベンチマークを、シミュレーションと実ロボットのデプロイの両方をサポートする統一された評価インターフェースを通じて統合している。 StarVLAはまた、データエンジニアリングが最小でも、すでにVLMとワールドモデルの両方のバックボーンで、複数のベンチマークで以前のメソッドにマッチまたは上回っている、シンプルで完全に再現可能なシングルベンチマークのトレーニングレシピも提供する。私たちの知る限り、StarVLAは利用可能な最も包括的なオープンソースVLAフレームワークの1つです。 StarVLAは積極的にメンテナンスされ、拡張されています。コードとドキュメントはhttps://github.com/starVLA/starVLAで公開されている。

論文の概要: StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

関連論文リスト