Fugu-MT 論文翻訳(概要): Supervise What Survives: Geometry-Guided VLA Adaptation from Synthetic Robot Videos

論文の概要: Supervise What Survives: Geometry-Guided VLA Adaptation from Synthetic Robot Videos

arxiv url: http://arxiv.org/abs/2606.24448v1
Date: Tue, 23 Jun 2026 11:35:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:16:48.917837
Title: Supervise What Survives: Geometry-Guided VLA Adaptation from Synthetic Robot Videos
Title（参考訳）: 生き残るものを監督する:合成ロボットビデオによる幾何学誘導型VLA適応
Authors: Danze Chen, Yanzhe Chen, Qiming Huang, Zhijun Cao, Chen Gao, Mike Zheng Shou,
Abstract要約: 生成した視覚から低レベル制御を導出することは、ミスマッチした抽象化である、と我々は主張する。我々は、将来の2次元エンドエフェクタ・ウェイポイントとして幾何学的コンテンツを抽出するtextbfGRAを提案する。実際のロボットタスクでは、GRAは一致したデータ予算の下で擬似アクションベースラインを上回ります。
参考スコア（独自算出の注目度）: 43.32573764638152
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action (VLA) models require large-scale video-action pairs, yet real teleoperation remains scarce. While generated robot videos offer a scalable alternative, existing methods treat them as real robot data by recovering pseudo-actions from synthesized pixels. We argue that deriving low-level control from generated visuals is a mismatched abstraction. A video captures only \emph{geometry}: the spatial trajectory representing the \emph{where} of a task. A real demonstration captures \emph{control}: the exact motor commands representing the \emph{how}. Human-to-robot video generation preserves these unequally: the visible geometry survives the generation process, while the underlying control signals are lost. This \textbf{Asymmetric Preservation Principle} dictates a clean rule: this surviving geometry should solely supervise visual perception, leaving control to real demonstrations. Following this principle, we propose \textbf{GRA} (\textbf{G}eometry-guided \textbf{R}epresentation \textbf{A}lignment), which extracts the geometric content as future 2D end-effector waypoints, computed from the source human video through pose estimation, retargeting, simulation, and calibrated projection, and routes them to the VLA vision backbone via an auxiliary 2D head. The action head is trained on real demonstrations only. During fine-tuning, the waypoint loss persists as a \textbf{spatial representation anchor} that prevents the backbone from losing its geometric grounding. On real-robot tasks, GRA outperforms pseudo-action baselines under matched data budgets and narrows the gap to policies trained with substantially more real demonstrations, suggesting that correctly routed geometry bridges generated videos to robot policies more reliably than recovered actions.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは大規模なビデオアクション・ペアを必要とするが、実際の遠隔操作は少ない。生成されたロボットビデオはスケーラブルな代替手段を提供するが、既存の方法では、合成されたピクセルから擬似アクションを回収することで、実際のロボットデータとして扱う。生成した視覚から低レベル制御を導出することは、ミスマッチした抽象化である、と我々は主張する。ビデオは、タスクの \emph{where} を表す空間的軌跡である \emph{geometry} のみをキャプチャする。実演は \emph{control}: \emph{how} を表す正確なモーターコマンドをキャプチャする。可視的幾何学は生成過程を生き残り、基礎となる制御信号は失われる。この『textbf{Asymmetric Preservation Principle} 』はクリーンな規則を定めている。この原理に従うと、将来の2Dエンドエフェクタ・ウェイポイントとして幾何学的コンテンツを抽出し、ポーズ推定、再ターゲティング、シミュレーション、キャリブレーションにより、ソース映像から計算し、補助的な2Dヘッドを介してVLAビジョンバックボーンにルーティングする。アクションヘッドは実際のデモのみにトレーニングされています。微調整の間、ウェイポイントの損失は‘textbf{spatial representation anchor}’として持続し、背骨が幾何学的な接地を失うのを防ぐ。実際のロボットタスクでは、GRAは一致したデータ予算の下で擬似アクションベースラインを上回り、実際の実演で訓練されたポリシーとのギャップを狭める。

論文の概要: Supervise What Survives: Geometry-Guided VLA Adaptation from Synthetic Robot Videos

関連論文リスト