Fugu-MT 論文翻訳(概要): MolmoAct2: Action Reasoning Models for Real-world Deployment

論文の概要: MolmoAct2: Action Reasoning Models for Real-world Deployment

arxiv url: http://arxiv.org/abs/2605.02881v2
Date: Fri, 08 May 2026 04:21:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 16:31:22.729325
Title: MolmoAct2: Action Reasoning Models for Real-world Deployment
Title（参考訳）: MolmoAct2: 実世界のデプロイのためのアクション推論モデル
Authors: Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu, Weikai Huang, Xiang Fan, Wei-Chuan Tsai, Shirui Chen, Yi Ru Wang, Shanli Xing, Jaemin Cho, Jae Sung Park, Ainaz Eftekhar, Peter Sushko, Karen Farley, Angad Wadhwa, Cole Harrison, Winson Han, Ying-Chun Lee, Eli VanderBilt, Rose Hendrix, Suveen Ellawela, Lucas Ngoo, Joyce Chai, Zhongzheng Ren, Ali Farhadi, Dieter Fox, Ranjay Krishna,
Abstract要約: MolmoAct2は、実用的なデプロイメントのために構築された、完全にオープンなアクション推論モデルである。空間的および具体的推論に特化した VLM バックボーンである MolmoER を紹介する。低コストプラットフォームにまたがる3つの新しいデータセットをリリースする。
参考スコア（独自算出の注目度）: 67.6315757474802
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、ロボットのための単一の汎用コントローラを提供することを目標としているが、今日のシステムは、現実のデプロイメントにおいて重要な基準に当てはまらない。フロンティアモデルはクローズドで、オープンウェイトな代替手段は高価なハードウェアと結びついており、推論の強化されたポリシーは、その基盤として禁止的なレイテンシを支払う。実運用のために構築された完全にオープンなアクション推論モデルである MolmoAct2 について述べる。空間的および具体的推論を専門とするVLMバックボーンであるM MolmoER について紹介する。われわれは,M MolmoAct2-BimanualYAMを含む低コストプラットフォームにまたがる3つの新しいデータセットをリリースした。オープンウェイトでオープンなアクショントークンであるOpenFASTは,5つの実施形態にわたる数百万のトラジェクトリでトレーニングされた,オープンウェイトなアクショントークンである。我々は,フローマッチング型連続動作エキスパートを層ごとのKV-cache条件で離散的なVLMに移植するアーキテクチャを再設計する。最後に,時間経過の異なるシーンでのみ深度トークンを予測し,幾何的なグラウンド化を先行レイテンシのごく一部で保持する適応深度推論変種であるMomoThinkを提案する。現在7つのシミュレーションと実世界のベンチマークにまたがる、あらゆるオープンVLAに関する最も広範な実証研究において、M MolmoAct2はPi-05を含む強力なベースラインを上回り、M MolmoERは13のエボデード推論ベンチマークでGPT-5とGemini Robotics ER-1.5を上回っている。モデルウェイト、トレーニングコード、完全なトレーニングデータをリリースします。プロジェクトページ:https://allenai.org/blog/molmoact2

論文の概要: MolmoAct2: Action Reasoning Models for Real-world Deployment

関連論文リスト