Fugu-MT 論文翻訳(概要): RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

論文の概要: RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2603.21341v1
Date: Sun, 22 Mar 2026 17:57:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.366928
Title: RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models
Title（参考訳）: RoboAlign:視覚言語行動モデルにおける言語行動アライメントのためのテスト時間推論学習
Authors: Dongyoung Kim, Sumin Park, Woomin Song, Seungku Kim, Taeyoung Kim, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, Younggyo Seo,
Abstract要約: RoboAlignは視覚言語アクションモデル(VLA)を訓練し、マルチモーダル理解を低レベルのアクションに変換する。我々のキーとなる考え方は、ゼロショット自然言語推論を用いてアクショントークンをサンプリングし、この推論を強化学習(RL)を用いて洗練し、アクション精度を向上させることである。 RoboAlignは、それぞれLIBERO、CALVIN、現実世界の環境におけるSFTベースラインよりも17.5%、18.9%、106.6%の性能向上を実現している。
参考スコア（独自算出の注目度）: 58.83401587988675
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in MLLMs through supervision of vision-question-answering type. However, these approaches have been reported to result in unstable VLA performance, often yielding only marginal or even negative gains. In this paper, we propose a more systematic MLLM training framework RoboAlign that reliably improves VLA performance. Our key idea is to sample action tokens via zero-shot natural language reasoning and refines this reasoning using reinforcement learning (RL) to improve action accuracy. As a result, RoboAlign bridges the modality gap between language and low-level actions in MLLMs, and facilitate knowledge transfer from MLLM to VLA. To validate the effectiveness of RoboAlign, we train VLAs by adding a diffusion-based action head on top of an MLLM backbone and evaluate them on major robotics benchmarks. Remarkably, by performing RL-based alignment after SFT using less than 1\% of the data, RoboAlign achieves performance improvements of 17.5\%, 18.9\%, and 106.6\% over SFT baselines on LIBERO, CALVIN, and real-world environments, respectively.
Abstract（参考訳）: マルチモーダル・大規模言語モデル(MLLM)における具体的推論の改善は、その上に視覚-言語-行動モデル(VLA)を構築する上で不可欠であり、マルチモーダル理解を低レベルな行動に変換するのに有用である。そこで,近年のMLLMにおける具体的推論の強化について,視覚質問応答型の監視を通じて検討している。しかしながら、これらのアプローチは不安定なVLA性能をもたらすと報告されており、しばしば限界あるいは負の利得しか得られない。本稿では,VLA性能を確実に向上する,より体系的なMLLMトレーニングフレームワークRoboAlignを提案する。我々のキーとなる考え方は、ゼロショット自然言語推論を通じてアクショントークンをサンプリングし、この推論を強化学習(RL)を用いて洗練し、アクション精度を向上させることである。結果として、RoboAlignはMLLMにおける言語と低レベルアクションの間のモダリティギャップを埋め、MLLMからVLAへの知識伝達を促進する。 RoboAlignの有効性を検証するため、MLLMバックボーン上に拡散型アクションヘッドを追加してVLAをトレーニングし、主要なロボティクスベンチマークで評価する。注目すべきは、データの1\%未満を使用してSFT後にRLベースのアライメントを実行することにより、RoboAlignは、それぞれLIBERO、CALVIN、実環境上のSFTベースラインよりも17.5\%、18.9\%、106.6\%の性能向上を達成することである。

論文の概要: RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

関連論文リスト