Fugu-MT 論文翻訳(概要): IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction

論文の概要: IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction

arxiv url: http://arxiv.org/abs/2510.07778v1
Date: Thu, 09 Oct 2025 04:49:46 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:14.867882
Title: IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction
Title（参考訳）: インテンションVLA:人間とロボットのインタラクションのための汎用的で効率的なインテンション推論
Authors: Yandu Chen, Kefan Gu, Yuqing Wen, Yucheng Zhao, Tiancai Wang, Liqiang Nie,
Abstract要約: Vision-Language-Action(VLA)モデルは、事前訓練された視覚言語モデル(VLM)を活用して、ロボット制御との認識を両立させる。カリキュラム学習パラダイムと効率的な推論機構を備えたVLAフレームワークである textbfIntentionVLA を提案する。提案手法はまず,意図推論,空間的接地,コンパクトな具体的推論を組み合わせ,慎重に設計した推論データを活用する。
参考スコア（独自算出の注目度）: 51.130510883952546
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models leverage pretrained vision-language models (VLMs) to couple perception with robotic control, offering a promising path toward general-purpose embodied intelligence. However, current SOTA VLAs are primarily pretrained on multimodal tasks with limited relevance to embodied scenarios, and then finetuned to map explicit instructions to actions. Consequently, due to the lack of reasoning-intensive pretraining and reasoning-guided manipulation, these models are unable to perform implicit human intention reasoning required for complex, real-world interactions. To overcome these limitations, we propose \textbf{IntentionVLA}, a VLA framework with a curriculum training paradigm and an efficient inference mechanism. Our proposed method first leverages carefully designed reasoning data that combine intention inference, spatial grounding, and compact embodied reasoning, endowing the model with both reasoning and perception capabilities. In the following finetuning stage, IntentionVLA employs the compact reasoning outputs as contextual guidance for action generation, enabling fast inference under indirect instructions. Experimental results show that IntentionVLA substantially outperforms $\pi_0$, achieving 18\% higher success rates with direct instructions and 28\% higher than ECoT under intention instructions. On out-of-distribution intention tasks, IntentionVLA achieves over twice the success rate of all baselines, and further enables zero-shot human-robot interaction with 40\% success rate. These results highlight IntentionVLA as a promising paradigm for next-generation human-robot interaction (HRI) systems.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、トレーニング済みの視覚言語モデル(VLM)を利用して、ロボット制御と認識を混同し、汎用的なインテリジェンスへの有望な道を提供する。しかしながら、現在のSOTA VLAは、主に、具体的シナリオに限定したマルチモーダルタスクで事前訓練され、明示的な命令をアクションにマッピングするために微調整される。その結果、推論が集中的な事前学習と推論誘導操作の欠如により、これらのモデルは複雑な現実世界の相互作用に必要な暗黙の人間の意図的推論を実行することができない。これらの制約を克服するために,カリキュラム学習パラダイムと効率的な推論機構を備えたVLAフレームワークである‘textbf{IntentionVLA}を提案する。提案手法は,まず,意図推論,空間的接地,コンパクトな具体的推論を組み合わせた,慎重に設計された推論データを活用する。次の微調整段階では、IntentionVLAはアクション生成のコンテキストガイダンスとしてコンパクトな推論出力を使用し、間接的な命令の下で高速な推論を可能にする。実験結果から,インテンションVLAは直接指示による成功率18\%,意図指示によるECoTよりも28\%,$\pi_0$を大きく上回ることがわかった。アウト・オブ・ディストリビューションの意図的タスクでは、IntentionVLAはすべてのベースラインの成功率の2倍以上を達成し、さらに40倍の成功率でゼロショットの人間とロボットの相互作用を可能にする。これらの結果は、次世代ロボットインタラクション(HRI)システムにおいて、IntentionVLAが有望なパラダイムであることを示している。

論文の概要: IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction

関連論文リスト