Fugu-MT 論文翻訳(概要): V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation

論文の概要: V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation

arxiv url: http://arxiv.org/abs/2604.09036v1
Date: Fri, 10 Apr 2026 06:56:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.734979
Title: V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation
Title（参考訳）: V-CAGE:ロボットマニピュレーションのための視覚クローズドループエージェント生成エンジン
Authors: Yaru Liu, Ao-bo Wang, Nanyang Ye,
Abstract要約: 本稿では,自律型ロボットデータ合成のためのエージェントフレームワークであるV-CAGEを提案する。従来のスクリプトパイプラインとは異なり、V-CAGEはエンボディ化されたエージェントシステムとして動作する。大規模ビデオデータセットのストレージボトルネックを克服するために、知覚駆動圧縮アルゴリズムを実装した。
参考スコア（独自算出の注目度）: 6.820118518027692
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling Vision-Language-Action (VLA) models requires massive datasets that are both semantically coherent and physically feasible. However, existing scene generation methods often lack context-awareness, making it difficult to synthesize high-fidelity environments embedded with rich semantic information, frequently resulting in unreachable target positions that cause tasks to fail prematurely. We present V-CAGE (Vision-Closed-loop Agentic Generation Engine), an agentic framework for autonomous robotic data synthesis. Unlike traditional scripted pipelines, V-CAGE operates as an embodied agentic system, leveraging foundation models to bridge high-level semantic reasoning with low-level physical interaction. Specifically, we introduce Inpainting-Guided Scene Construction to systematically arrange context-aware layouts, ensuring that the generated scenes are both semantically structured and kinematically reachable. To ensure trajectory correctness, we integrate functional metadata with a Vision-Language Model based closed-loop verification mechanism, acting as a visual critic to rigorously filter out silent failures and sever the error propagation chain. Finally, to overcome the storage bottleneck of massive video datasets, we implement a perceptually-driven compression algorithm that achieves over 90\% filesize reduction without compromising downstream VLA training efficacy. By centralizing semantic layout planning and visual self-verification, V-CAGE automates the end-to-end pipeline, enabling the highly scalable synthesis of diverse, high-quality robotic manipulation datasets.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルをスケールするには、セマンティックコヒーレントかつ物理的に実現可能な巨大なデータセットが必要である。しかし、既存のシーン生成手法は文脈認識に欠けることが多く、リッチなセマンティック情報に埋め込まれた高忠実な環境を合成することは困難であり、多くの場合、タスクを早期に失敗させるような到達不能な目標位置を生じる。本稿では,自律型ロボットデータ合成のためのエージェントフレームワークであるV-CAGE(Vision-Closed-loop Agentic Generation Engine)を提案する。従来のスクリプトパイプラインとは異なり、V-CAGEはエンボディエージェントシステムとして機能し、基礎モデルを利用して低レベルの物理的相互作用で高レベルのセマンティック推論をブリッジする。 Inpainting-Guided Scene Constructionを導入し、コンテキスト対応レイアウトを体系的に配置し、生成されたシーンが意味的に構造化され、キネマティックに到達できるようにする。トラジェクトリの正確性を確保するため、視覚的批判として機能する視覚言語モデルに基づくクローズドループ検証機構と機能メタデータを統合し、サイレント障害を厳格にフィルタリングし、エラー伝播チェーンを切断する。最後に、大規模なビデオデータセットのストレージボトルネックを克服するために、下流VLAトレーニングの有効性を損なうことなく、90%以上のファイルサイズ削減を実現する知覚駆動圧縮アルゴリズムを実装した。セマンティックレイアウト計画と視覚的自己検証を集中することにより、V-CAGEはエンドツーエンドパイプラインを自動化し、多様な高品質なロボット操作データセットの高度にスケーラブルな合成を可能にする。

論文の概要: V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation

関連論文リスト