Fugu-MT 論文翻訳(概要): NanoControl: A Lightweight Framework for Precise and Efficient Control in Diffusion Transformer

論文の概要: NanoControl: A Lightweight Framework for Precise and Efficient Control in Diffusion Transformer

arxiv url: http://arxiv.org/abs/2508.10424v1
Date: Thu, 14 Aug 2025 07:54:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-15 22:24:48.219878
Title: NanoControl: A Lightweight Framework for Precise and Efficient Control in Diffusion Transformer
Title（参考訳）: NanoControl:拡散変圧器の高精度かつ効率的な制御のための軽量フレームワーク
Authors: Shanyuan Liu, Jian Zhu, Junda Lu, Yue Gong, Liuzhuozheng Li, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Dawei Leng, Yuhui Yin,
Abstract要約: NanoControlは、制御可能なテキスト・ツー・イメージ生成のためのバックボーンネットワークとしてFluxを使用している。我々のモデルは、最先端の制御可能なテキスト・ツー・イメージ生成性能を実現する。パラメータ数は0.024%増加し、GFLOPは0.029%増加し、高効率な制御可能な生成を可能にする。
参考スコア（独自算出の注目度）: 14.644014499085943
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion Transformers (DiTs) have demonstrated exceptional capabilities in text-to-image synthesis. However, in the domain of controllable text-to-image generation using DiTs, most existing methods still rely on the ControlNet paradigm originally designed for UNet-based diffusion models. This paradigm introduces significant parameter overhead and increased computational costs. To address these challenges, we propose the Nano Control Diffusion Transformer (NanoControl), which employs Flux as the backbone network. Our model achieves state-of-the-art controllable text-to-image generation performance while incurring only a 0.024\% increase in parameter count and a 0.029\% increase in GFLOPs, thus enabling highly efficient controllable generation. Specifically, rather than duplicating the DiT backbone for control, we design a LoRA-style (low-rank adaptation) control module that directly learns control signals from raw conditioning inputs. Furthermore, we introduce a KV-Context Augmentation mechanism that integrates condition-specific key-value information into the backbone in a simple yet highly effective manner, facilitating deep fusion of conditional features. Extensive benchmark experiments demonstrate that NanoControl significantly reduces computational overhead compared to conventional control approaches, while maintaining superior generation quality and achieving improved controllability.
Abstract（参考訳）: 拡散変換器 (DiT) はテキスト・画像合成において例外的な機能を示した。しかし、DiTを用いた制御可能なテキスト・画像生成の分野では、既存のほとんどのメソッドは、元来UNetベースの拡散モデルのために設計されたコントロールネットパラダイムに依存している。このパラダイムは、重要なパラメータのオーバーヘッドと計算コストの増加をもたらす。これらの課題に対処するために,Fluxをバックボーンネットワークとして利用するNano Control Diffusion Transformer (NanoControl)を提案する。本モデルでは,パラメータ数0.024\%の増加とGFLOPの0.029\%増加に留まらず,最先端の制御可能なテキスト・画像生成性能を実現し,高効率な制御可能生成を実現する。具体的には、制御のためにDiTバックボーンを複製するのではなく、生条件入力から直接制御信号を学習するLoRAスタイルの制御モジュールを設計する。さらに,条件固有のキー値情報をバックボーンに組み込むKV-Context Augmentation機構を導入し,条件特徴の深い融合を容易にする。大規模なベンチマーク実験により、NanoControlは従来の制御手法に比べて計算オーバーヘッドを著しく低減し、優れた生成品質を維持し、制御性の向上を実現している。

論文の概要: NanoControl: A Lightweight Framework for Precise and Efficient Control in Diffusion Transformer

関連論文リスト