Fugu-MT 論文翻訳(概要): Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion

論文の概要: Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion

arxiv url: http://arxiv.org/abs/2604.01761v1
Date: Thu, 02 Apr 2026 08:27:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.612449
Title: Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion
Title（参考訳）: Control-DINO:制御可能な画像-映像拡散のための特徴空間条件
Authors: Edoardo A. Dominici, Thomas Deixelberger, Konstantinos Vardis, Markus Steinberger,
Abstract要約: 他の機能から外観を分離する軽量なアーキテクチャとトレーニング戦略を導入します。空間分解能の低い空間分解能は高次元で補うことができ、空間表現から生成的レンダリングの制御性が向上することを示す。
参考スコア（独自算出の注目度）: 4.4853338999399375
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Video models have recently been applied with success to problems in content generation, novel view synthesis, and, more broadly, world simulation. Many applications in generation and transfer rely on conditioning these models, typically through perceptual, geometric, or simple semantic signals, fundamentally using them as generative renderers. At the same time, high-dimensional features obtained from large-scale self-supervised learning on images or point clouds are increasingly used as a general-purpose interface for vision models. The connection between the two has been explored for subject specific editing, aligning and training video diffusion models, but not in the role of a more general conditioning signal for pretrained video diffusion models. Features obtained through self-supervised learning like DINO, contain a lot of entangled information about style, lighting and semantics of the scene. This makes them great at reconstruction tasks but limits their generative capabilities. In this paper, we show how we can use the features for tasks such as video domain transfer and video-from-3D generation. We introduce a lightweight architecture and training strategy that decouples appearance from other features that we wish to preserve, enabling robust control for appearance changes such as stylization and relighting. Furthermore, we show that low spatial resolution can be compensated by higher feature dimensionality, improving controllability in generative rendering from explicit spatial representations.
Abstract（参考訳）: ビデオモデルは、コンテンツ生成、新しいビュー合成、そしてより広範に世界シミュレーションにおける問題に成功するために最近応用されている。生成および転送における多くのアプリケーションは、一般的に知覚的、幾何学的、あるいは単純な意味的な信号を通じて、これらのモデルを条件付けすることに依存し、基本的には生成的レンダラーとして使用する。同時に、画像や点雲における大規模自己教師型学習から得られる高次元特徴は、視覚モデルのための汎用インターフェースとして、ますます多く利用されている。この2つの関係は、ビデオ拡散モデルの編集、調整、訓練のために検討されてきたが、事前訓練されたビデオ拡散モデルに対するより一般的な条件付け信号の役割は果たさない。 DINOのような自己教師型学習によって得られた特徴には、シーンのスタイル、照明、セマンティクスに関する多くの絡み合った情報が含まれている。これにより、再構築作業に優れるが、生成能力は制限される。本稿では,ビデオドメイン転送や3D映像生成などのタスクに,これらの機能をどのように利用できるかを示す。我々は、外観を保存したい他の特徴と切り離す軽量なアーキテクチャとトレーニング戦略を導入し、スタイリゼーションやリライトといった外観変化の堅牢な制御を可能にします。さらに,低空間分解能は高次元で補うことができ,空間表現から生成的レンダリングの制御性が向上することを示した。

論文の概要: Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion

関連論文リスト