Fugu-MT 論文翻訳(概要): Towards Generative Predictive Display for Vision-Based Teleoperation: A Zero-Shot Benchmark of Off-the-Shelf Video Models

論文の概要: Towards Generative Predictive Display for Vision-Based Teleoperation: A Zero-Shot Benchmark of Off-the-Shelf Video Models

arxiv url: http://arxiv.org/abs/2605.09670v1
Date: Sun, 10 May 2026 17:36:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.362161
Title: Towards Generative Predictive Display for Vision-Based Teleoperation: A Zero-Shot Benchmark of Off-the-Shelf Video Models
Title（参考訳）: ビジョンベース遠隔操作のための生成予測ディスプレイを目指して:オフザシェルフ映像モデルのゼロショットベンチマーク
Authors: Aws Khalil, Jaerock Kwon,
Abstract要約: 本稿では,ショートホライズン予測表示のためのオフ・ザ・シェルフ生成ビデオモデルのベンチマークを示す。性能は、予測精度、ロールアウト毎のレイテンシ、ピークGPUメモリ使用量、時間的エラー進化を用いて評価される。発見は、汎用な生成ビデオ合成と遠隔操作における予測表示の要件とのギャップを浮き彫りにする。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Teleoperation systems are fundamentally limited by communication latency, which degrades situational awareness and control performance. Predictive display aims to mitigate this limitation by presenting an estimate of the current visual state rather than delayed observations. While recent advances in generative video models enable high-quality video synthesis, their suitability for latency-sensitive predictive display remains unclear. This paper presents a zero-shot benchmark of off-the-shelf generative video models for short-horizon predictive display, without task-specific fine-tuning. We formulate the problem as rollout-based future frame prediction and develop a unified benchmarking pipeline using simulated driving data from the CARLA simulator. Five publicly released video models spanning transformer-based and diffusion-based families are evaluated across two resolutions and two conditioning regimes (multi-frame and single-frame). Performance is assessed using prediction accuracy (mean absolute difference), per-rollout latency, peak GPU memory usage, and temporal error evolution across the prediction horizon. On this zero-shot benchmark, no tested model simultaneously achieves low rollout error, non-divergent per-step error behavior, and real-time inference at the source frame rate. Increasing model scale or resolution yields limited and, in some cases, inverted improvements. These findings highlight a gap between general-purpose generative video synthesis and the requirements of predictive display in teleoperation, suggesting that practical deployment will require either explicit short-horizon temporal supervision, in-domain adaptation, or aggressive inference optimization rather than direct application of off-the-shelf models. Code, configurations, and qualitative results are released on the project page: https://bimilab.github.io/paper-GenPD
Abstract（参考訳）: 遠隔操作システムは基本的に通信遅延によって制限され、状況認識と制御性能が低下する。予測表示は、遅延観測よりも現在の視覚状態を推定することで、この制限を緩和することを目的としている。近年,映像生成モデルの進歩により高品質な映像合成が可能となったが,遅延感度予測ディスプレイへの適合性はまだ不明である。本稿では,タスク固有の微調整を伴わずに,ショートホライズン予測表示のためのオフ・ザ・シェルフ生成ビデオモデルのゼロショットベンチマークを提案する。 CARLAシミュレータのシミュレーション駆動データを用いて,将来のフレーム予測として問題を定式化し,統一的なベンチマークパイプラインを開発する。トランスフォーマーベースおよび拡散ベースファミリーにまたがる5つの公開ビデオモデルは、2つの解像度と2つの条件付きレジーム(マルチフレームとシングルフレーム)で評価される。性能は予測精度(平均絶対差)、ロールアウト毎のレイテンシ、ピークGPUメモリ使用量、予測地平線を越えた時間誤差の進化を用いて評価される。このゼロショットベンチマークでは、テストされたモデルは、ローロールアウトエラー、ステップごとの非発散誤差、およびソースフレームレートでのリアルタイム推論を同時に達成する。モデルスケールや解像度の増大は制限され、場合によっては改善が反転する。これらの結果は,汎用映像合成と遠隔操作における予測表示の要件とのギャップを浮き彫りにしており,実際の展開には,市販モデルを直接適用するのではなく,時間的時間的時間的監督,ドメイン内適応,あるいはアグレッシブ推論最適化が必要であることを示唆している。コード、設定、定性的な結果は、プロジェクトページでリリースされている。

論文の概要: Towards Generative Predictive Display for Vision-Based Teleoperation: A Zero-Shot Benchmark of Off-the-Shelf Video Models

関連論文リスト