Fugu-MT 論文翻訳(概要): Build on Priors: Vision--Language--Guided Neuro-Symbolic Imitation Learning for Data-Efficient Real-World Robot Manipulation

論文の概要: Build on Priors: Vision--Language--Guided Neuro-Symbolic Imitation Learning for Data-Efficient Real-World Robot Manipulation

arxiv url: http://arxiv.org/abs/2604.03759v1
Date: Sat, 04 Apr 2026 15:17:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:18.757015
Title: Build on Priors: Vision--Language--Guided Neuro-Symbolic Imitation Learning for Data-Efficient Real-World Robot Manipulation
Title（参考訳）: データ効率の良い実世界ロボットマニピュレーションのための視覚言語指導型ニューロシンボリック模倣学習
Authors: Pierrick Lorang, Johannes Huemer, Timothy Duggan, Kai Goebel, Patrik Zips, Matthias Scheutz,
Abstract要約: 本稿では,象徴的計画領域とデータ効率制御ポリシを自律的に構築する,スケーラブルなニューロシンボリック・フレームワークを提案する。本手法は,実演をスキルに分割し,視覚言語モデル(VLM)を用いてスキルを分類する。既知のコントローラは、シーン内の他のオブジェクトに1つのデモを投影することで、実世界のデータ拡張に活用することができる。
参考スコア（独自算出の注目度）: 4.118262876469644
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Enabling robots to learn long-horizon manipulation tasks from a handful of demonstrations remains a central challenge in robotics. Existing neuro-symbolic approaches often rely on hand-crafted symbolic abstractions, semantically labeled trajectories or large demonstration datasets, limiting their scalability and real-world applicability. We present a scalable neuro-symbolic framework that autonomously constructs symbolic planning domains and data-efficient control policies from as few as one to thirty unannotated skill demonstrations, without requiring manual domain engineering. Our method segments demonstrations into skills and employs a Vision-Language Model (VLM) to classify skills and identify equivalent high-level states, enabling automatic construction of a state-transition graph. This graph is processed by an Answer Set Programming solver to synthesize a PDDL planning domain, which an oracle function exploits to isolate the minimal, task-relevant and target relative observation and action spaces for each skill policy. Policies are learned at the control reference level rather than at the raw actuator signal level, yielding a smoother and less noisy learning target. Known controllers can be leveraged for real-world data augmentation by projecting a single demonstration onto other objects in the scene, simultaneously enriching the graph construction process and the dataset for imitation learning. We validate our framework primarily on a real industrial forklift across statistically rigorous manipulation trials, and demonstrate cross-platform generality on a Kinova Gen3 robotic arm across two standard benchmarks. Our results show that grounding control learning, VLM-driven abstraction, and automated planning synthesis into a unified pipeline constitutes a practical path toward scalable, data-efficient, expert-free and interpretable neuro-symbolic robotics.
Abstract（参考訳）: ロボットを使って、いくつかのデモから長距離操作のタスクを学ぶことは、ロボット工学の重要な課題だ。既存のニューロシンボリックアプローチは、しばしば手作りの象徴的抽象化、意味的にラベル付けされた軌跡や大規模なデモンストレーションデータセットに依存し、スケーラビリティと現実の応用性を制限する。我々は、手動のドメイン工学を必要とせずに、1から30の未発表のスキル実証から、シンボリックプランニングドメインとデータ効率制御ポリシーを自律的に構築するスケーラブルなニューロシンボリックフレームワークを提案する。提案手法は,実演をスキルに分割し,VLM(Vision-Language Model)を用いてスキルを分類し,等価な高レベルな状態を同定し,状態遷移グラフの自動構築を可能にする。このグラフはAnswer Set Programmingソルバによって処理され、PDDL計画ドメインを合成する。これは、オラクル関数が、スキルポリシーごとに最小限、タスク関連およびターゲット相対観測およびアクション空間を分離するために利用するものである。実際のアクチュエータ信号レベルよりも制御基準レベルでポリシが学習され、スムーズでノイズの少ない学習目標が得られる。既知のコントローラは、シーン内の他のオブジェクトに1つのデモを投影することで、実世界のデータ拡張に利用することができ、グラフ構築プロセスと模倣学習のためのデータセットを同時に強化することができる。我々は,統計学的に厳格な操作試験にまたがる実際の産業用フォークリフトの枠組みを検証するとともに,Kinova Gen3ロボットアームの2つの標準ベンチマークにおけるクロスプラットフォームの汎用性を実証する。この結果から, 基盤制御学習, VLM による抽象化, および統合パイプラインへの自動計画合成が, スケーラブルでデータ効率のよい, 専門家のいない, 解釈可能なニューロシンボリック・ロボティクスへの実践的な道となることが示唆された。

論文の概要: Build on Priors: Vision--Language--Guided Neuro-Symbolic Imitation Learning for Data-Efficient Real-World Robot Manipulation

関連論文リスト