Fugu-MT 論文翻訳(概要): ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

論文の概要: ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

arxiv url: http://arxiv.org/abs/2510.04290v1
Date: Sun, 05 Oct 2025 17:02:01 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.564364
Title: ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation
Title（参考訳）: ChronoEdit: 画像編集と世界シミュレーションのための時間的推論を目指して
Authors: Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, Huan Ling,
Abstract要約: ビデオ生成問題として画像編集を再構成するフレームワークであるChronoEditを紹介する。まずChronoEditは、入力された画像と編集された画像を、ビデオの最初のフレームと最後のフレームとして扱う。第2に、ChronoEditでは、推論時に明示的に編集を行う時間的推論ステージを導入している。
参考スコア（独自算出の注目度）: 74.33442027081651
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large generative models have significantly advanced image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, the target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Code and models for both the 14B and 2B variants of ChronoEdit will be released on the project page: https://research.nvidia.com/labs/toronto-ai/chronoedit
Abstract（参考訳）: 大規模な生成モデルにおける最近の進歩は、画像編集とコンテキスト内画像生成が著しく進歩しているが、編集対象が一貫性を保つためには重要なギャップが残っている。この能力は、特に世界シミュレーションに関連するタスクに不可欠である。本稿では,映像編集を映像生成問題として再編成するフレームワークであるChronoEditを紹介する。まず、ChronoEditは入力された画像と編集された画像をビデオの最初のフレームと最後のフレームとして扱い、学習された時間的一貫性を通して、物体の外観だけでなく、暗黙の物理と相互作用をキャプチャする巨大な事前訓練されたビデオ生成モデルを活用することができる。第2に、ChronoEditでは、推論時に明示的に編集を行う時間的推論ステージを導入している。この設定の下では、対象のフレームは推論トークンと共同で識別され、解空間を物理的に実行可能な変換に制限する妥当な編集軌跡を想像する。推論トークンは、フルビデオをレンダリングする高い計算コストを避けるために、数ステップ後に削除される。 ChronoEditを検証するために、PBench-Editを導入する。PBench-Editは、物理的な整合性を必要とするコンテキストに対する画像プロンプトペアの新しいベンチマークであり、ChronoEditが視覚的忠実度と物理的妥当性の両方において最先端のベースラインを超えることを実証する。 ChronoEditの14B版と2B版の両方のコードとモデルは、プロジェクトページで公開される。

論文の概要: ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

関連論文リスト