Fugu-MT 論文翻訳(概要): From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

論文の概要: From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

arxiv url: http://arxiv.org/abs/2604.04974v1
Date: Sat, 04 Apr 2026 15:37:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-08 17:42:09.372868
Title: From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
Title（参考訳）: 映像から制御へ:時間的視覚データからの操作インタフェースの学習に関する調査
Authors: Linfang Zheng, Zikai Ouyang, Chen Wang, Jia Pan, Wei Zhang,
Abstract要約: ビデオは、オブジェクトの移動の仕方、接触の展開の仕方、相互作用中のシーンの進化の仕方などをキャプチャする。ビデオは行動の監督に欠けており、体格、視点、身体的制約におけるロボットの経験とは異なる。本研究は,ロボット操作のための制御インタフェースを学習するために,非アクションアノテート時間ビデオを利用する手法についてレビューする。
参考スコア（独自算出の注目度）: 17.579758359658218
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video is a scalable observation of physical dynamics: it captures how objects move, how contact unfolds, and how scenes evolve under interaction -- all without requiring robot action labels. Yet translating this temporal structure into reliable robotic control remains an open challenge, because video lacks action supervision and differs from robot experience in embodiment, viewpoint, and physical constraints. This survey reviews methods that exploit non-action-annotated temporal video to learn control interfaces for robotic manipulation. We introduce an \emph{interface-centric taxonomy} organized by where the video-to-control interface is constructed and what control properties it enables, identifying three families: direct video--action policies, which keep the interface implicit; latent-action methods, which route temporal structure through a compact learned intermediate; and explicit visual interfaces, which predict interpretable targets for downstream control. For each family, we analyze control-integration properties -- how the loop is closed, what can be verified before execution, and where failures enter. A cross-family synthesis reveals that the most pressing open challenges center on the \emph{robotics integration layer} -- the mechanisms that connect video-derived predictions to dependable robot behavior -- and we outline research directions toward closing this gap.
Abstract（参考訳）: ビデオは、物理的なダイナミクスのスケーラブルな観察であり、オブジェクトの動き、接触の展開、相互作用中のシーンの進化を、ロボットのアクションラベルを必要とせずにキャプチャする。しかし、この時間構造を信頼性のあるロボット制御に変換することは、ビデオが動作の監督に欠けており、実施、視点、身体的制約のロボット体験とは異なるため、依然としてオープンな課題である。本研究は,ロボット操作のための制御インタフェースを学習するために,非アクションアノテート時間ビデオを利用する手法についてレビューする。本稿では,映像対制御インタフェースの構築と,その実現可能な制御特性の3つのファミリーを識別し,インタフェースを暗黙的に保つダイレクトビデオ-アクションポリシー,コンパクトな学習中間体を介して時間的構造を経路する潜在アクション手法,下流制御の解釈可能なターゲットを予測する明示的なビジュアルインターフェースを提案する。各ファミリーについて、ループのクローズ方法、実行前に何が検証可能か、障害がどこから入ってくるのか、制御統合プロパティを分析します。クロスファミリー合成では、ビデオからの予測と信頼性のあるロボットの振る舞いを結びつけるメカニズムである「emph{robotics integration layer}」を中心に、最も押し寄せるオープンな課題が示され、我々はこのギャップを埋めるための研究の方向性を概説する。

論文の概要: From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

関連論文リスト