Fugu-MT 論文翻訳(概要): EVCtrl: Efficient Control Adapter for Visual Generation

論文の概要: EVCtrl: Efficient Control Adapter for Visual Generation

arxiv url: http://arxiv.org/abs/2508.10963v1
Date: Thu, 14 Aug 2025 14:11:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-18 14:51:23.615684
Title: EVCtrl: Efficient Control Adapter for Visual Generation
Title（参考訳）: EVCtrl:ビジュアルジェネレーションのための効率的な制御アダプタ
Authors: Zixiang Yang, Yue Ma, Yinhan Zhang, Shanhui Mo, Dongrui Liu, Linfeng Zhang,
Abstract要約: モデルを再トレーニングすることなくオーバーヘッドを削減できる軽量なプラグアンドプレイ制御アダプタであるEVCtrlを導入する。 CogVideo-Controlnet,Wan2.1-Controlnet,Fluxの実験により,本手法が訓練を必要とせずに画像および映像制御生成に有効であることを実証した。
参考スコア（独自算出の注目度）: 9.62167187199932
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual generation includes both image and video generation, training probabilistic models to create coherent, diverse, and semantically faithful content from scratch. While early research focused on unconditional sampling, practitioners now demand controllable generation that allows precise specification of layout, pose, motion, or style. While ControlNet grants precise spatial-temporal control, its auxiliary branch markedly increases latency and introduces redundant computation in both uncontrolled regions and denoising steps, especially for video. To address this problem, we introduce EVCtrl, a lightweight, plug-and-play control adapter that slashes overhead without retraining the model. Specifically, we propose a spatio-temporal dual caching strategy for sparse control information. For spatial redundancy, we first profile how each layer of DiT-ControlNet responds to fine-grained control, then partition the network into global and local functional zones. A locality-aware cache focuses computation on the local zones that truly need the control signal, skipping the bulk of redundant computation in global regions. For temporal redundancy, we selectively omit unnecessary denoising steps to improve efficiency. Extensive experiments on CogVideo-Controlnet, Wan2.1-Controlnet, and Flux demonstrate that our method is effective in image and video control generation without the need for training. For example, it achieves 2.16 and 2.05 times speedups on CogVideo-Controlnet and Wan2.1-Controlnet, respectively, with almost no degradation in generation quality.Codes are available in the supplementary materials.
Abstract（参考訳）: ビジュアル生成には、画像生成とビデオ生成の両方が含まれ、一貫性があり多様な、意味的に忠実なコンテンツをスクラッチから作成するための確率モデルを訓練する。初期の研究では、無条件サンプリングに焦点を当てていたが、現在では、レイアウト、ポーズ、モーション、スタイルの正確な指定を可能にする制御可能な世代が要求されている。 ControlNetは正確な時空間制御を許すが、その補助分岐は遅延を著しく増加させ、制御されていない領域と特にビデオのデノイングステップの両方で冗長な計算を導入する。この問題に対処するために、モデルを再トレーニングすることなくオーバーヘッドを削減する軽量なプラグアンドプレイ制御アダプタであるEVCtrlを導入する。具体的には,スパース制御情報に対する時空間二重キャッシュ方式を提案する。空間冗長性については、まずDiT-ControlNetの各層が粒度制御にどのように反応するかをプロファイリングし、次にネットワークをグローバルおよびローカルな機能ゾーンに分割する。局所性を考慮したキャッシュは、制御信号が本当に必要なローカルゾーンの計算に焦点を合わせ、グローバルリージョンにおける冗長な計算の大部分をスキップする。時間的冗長性では、不要な除音ステップを選択的に省略し、効率を向上する。 CogVideo-Controlnet, Wan2.1-Controlnet, Flux の大規模な実験により,本手法はトレーニングを必要とせずに画像および映像制御生成に有効であることが示された。例えば、CogVideo-ControlnetとWan2.1-Controlnetで2.16倍、2.05倍のスピードアップを実現しており、生成品質の劣化はほとんどない。

論文の概要: EVCtrl: Efficient Control Adapter for Visual Generation

関連論文リスト