Fugu-MT 論文翻訳(概要): RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation

論文の概要: RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation

arxiv url: http://arxiv.org/abs/2602.09973v1
Date: Tue, 10 Feb 2026 17:01:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.331846
Title: RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation
Title（参考訳）: RoboInter:ロボットマニピュレーションに向けたホロスティックな中間表現スイート
Authors: Hao Li, Ziqin Wang, Zi-han Ding, Shuai Yang, Yilun Chen, Yang Tian, Xiaolin Hu, Tai Wang, Dahua Lin, Feng Zhao, Si Liu, Jiangmiao Pang,
Abstract要約: RoboInter Manipulation Suiteはデータ、ベンチマーク、中間表現のモデルを含む統一されたリソースである。多様な表現の半自動アノテーションを可能にする軽量GUIであるRoboInter-Toolと、571の多様なシーンにわたる230万回以上のエピソードを含む大規模なデータセットであるRoboInter-Dataで構成されている。 RoboInter-VLAは、モジュールとエンドツーエンドのVLAバリアントをサポートする、統合されたプラン-then-executeフレームワークを提供する。
参考スコア（独自算出の注目度）: 104.68774434699158
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Advances in large vision-language models (VLMs) have stimulated growing interest in vision-language-action (VLA) systems for robot manipulation. However, existing manipulation datasets remain costly to curate, highly embodiment-specific, and insufficient in coverage and diversity, thereby hindering the generalization of VLA models. Recent approaches attempt to mitigate these limitations via a plan-then-execute paradigm, where high-level plans (e.g., subtasks, trace) are first generated and subsequently translated into low-level actions, but they critically rely on extra intermediate supervision, which is largely absent from existing datasets. To bridge this gap, we introduce the RoboInter Manipulation Suite, a unified resource including data, benchmarks, and models of intermediate representations for manipulation. It comprises RoboInter-Tool, a lightweight GUI that enables semi-automatic annotation of diverse representations, and RoboInter-Data, a large-scale dataset containing over 230k episodes across 571 diverse scenes, which provides dense per-frame annotations over more than 10 categories of intermediate representations, substantially exceeding prior work in scale and annotation quality. Building upon this foundation, RoboInter-VQA introduces 9 spatial and 20 temporal embodied VQA categories to systematically benchmark and enhance the embodied reasoning capabilities of VLMs. Meanwhile, RoboInter-VLA offers an integrated plan-then-execute framework, supporting modular and end-to-end VLA variants that bridge high-level planning with low-level execution via intermediate supervision. In total, RoboInter establishes a practical foundation for advancing robust and generalizable robotic learning via fine-grained and diverse intermediate representations.
Abstract（参考訳）: 大きな視覚言語モデル(VLM)の進歩は、ロボット操作のための視覚言語アクション(VLA)システムへの関心の高まりを刺激している。しかし、既存の操作データセットは、非常に具体的であり、カバー範囲と多様性が不十分であるため、VLAモデルの一般化を妨げている。近年のアプローチでは、高レベルな計画(例えば、サブタスク、トレース)が最初に生成され、その後低レベルなアクションに変換されるようなプラン-then-executeパラダイムを通じてこれらの制限を緩和しようとするが、それらは既存のデータセットからほとんど欠落している追加の中間監視に依存している。このギャップを埋めるために、データ、ベンチマーク、中間表現のモデルを含む統一されたリソースであるRoboInter Manipulation Suiteを紹介します。多様な表現の半自動アノテーションを可能にする軽量GUIであるRoboInter-Toolと、571の多様なシーンにわたる230万回以上のエピソードを含む大規模なデータセットであるRoboInter-Dataで構成されている。この基盤の上に、RoboInter-VQAは9つの空間的および20の時間的エンボディドVQAカテゴリを導入し、VLMのエンボディド推論能力を体系的にベンチマークし、強化する。一方、RoboInter-VLAは、モジュールとエンドツーエンドのVLAバリアントをサポートし、中間管理を通じて低レベルの実行で高レベルのプランニングをブリッジする統合されたプラン-then-executeフレームワークを提供する。 RoboInterは、細粒度で多様な中間表現を通じて、堅牢で汎用的なロボット学習を促進するための実践的な基盤を確立している。

論文の概要: RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation

関連論文リスト