Fugu-MT 論文翻訳(概要): PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

論文の概要: PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

arxiv url: http://arxiv.org/abs/2604.20834v2
Date: Fri, 24 Apr 2026 16:37:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 13:34:22.034102
Title: PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance
Title（参考訳）: PokeVLA: 包括的世界知識誘導によるポケットサイズのビジョンランゲージ・アクションモデル
Authors: Yupeng Zheng, Xiang Li, Songen Gu, Yuhang Zheng, Shuai Tian, Weize Li, Linbo Wang, Senyu Fei, Pengfei Li, Yinfeng Gao, Zebin Xing, Yilun Chen, Qichao Zhang, Haoran Li, Wenchao Ding,
Abstract要約: PokeVLAは、視覚言語理解をアクション学習に注入する、埋め込み操作のための軽量モデルである。まず、2.4Mサンプルのマルチモーダルデータセット上で、コンパクトな視覚言語モデル(PokeVLM)を事前訓練する。 LIBERO-Plusベンチマークと実世界のデプロイで、最先端のパフォーマンスが実証されている。
参考スコア（独自算出の注目度）: 24.102770290097435
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial awareness. To address these challenges, we propose PokeVLA, a lightweight yet powerful foundation model for embodied manipulation that effectively infuses vision-language understanding into action learning. Our framework introduces a two-stage training paradigm: first, we pre-train a compact vision-language model (PokeVLM) on a curated multimodal dataset of 2.4M samples encompassing spatial grounding, affordance, and embodied reasoning tasks; second, we inject manipulation-relevant representations into the action space through multi-view goal-aware semantics learning, geometry alignment, and a novel action expert. Extensive experiments demonstrate state-of-the-art performance on the LIBERO-Plus benchmark and in real-world deployment, outperforming comparable baselines in success rate and robustness under diverse perturbations. To foster reproducibility and community progress, we will open-source our code, model weights, and the scripts for the curated pre-training dataset. Project page: https://getterupper.github.io/PokeVLA
Abstract（参考訳）: 近年のVLA(Vision-Language-Action)モデルは,ロボット操作のための新たな道を開いたが,既存の手法では効率が制限され,高度な知識や空間認識が欠如している。これらの課題に対処するため、我々はPokeVLAを提案する。PokeVLAは、動作学習に視覚言語理解を効果的に注入する、操作を具体化するための軽量で強力な基盤モデルである。まず、空間的接地、空き時間、具体的推論タスクを含む2.4Mサンプルのキュレートされたマルチモーダルデータセット上で、コンパクトな視覚言語モデル(PokeVLM)を事前学習し、第2に、多視点ゴール認識セマンティクス学習、幾何学的アライメント、新しいアクションエキスパートを通じて、アクション空間に操作関連表現を注入する。 LIBERO-Plusベンチマークや実世界の展開において、さまざまな摂動の下で成功率と堅牢性において同等のベースラインを上回り、最先端のパフォーマンスを実証する大規模な実験が実施されている。再現性とコミュニティの進歩を促進するため、私たちは、コードのオープンソース化、モデルの重み付け、キュレートされた事前トレーニングデータセットのスクリプトを公開します。プロジェクトページ: https://getterupper.github.io/PokeVLA

論文の概要: PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

関連論文リスト