Fugu-MT 論文翻訳(概要): V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

論文の概要: V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

arxiv url: http://arxiv.org/abs/2603.16792v1
Date: Tue, 17 Mar 2026 17:01:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.439003
Title: V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising
Title（参考訳）: V-Co:コ・デノナイズによる視覚的表現のアライメントのクローズアップ
Authors: Han Lin, Xichen Pan, Zun Wang, Yue Zhang, Chu Wang, Jaemin Cho, Mohit Bansal,
Abstract要約: 統合JTフレームワークにおける視覚的コデノゲーションの体系的研究であるV-Coについて述べる。本研究は,視覚的コデノジングを効果的に行うための4つの重要な要素を明らかにする。 V-Coは、基礎となる画素空間拡散ベースラインと強い前の画素拡散法より優れている。
参考スコア（独自算出の注目度）: 65.5867130156805
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.
Abstract（参考訳）: 画素空間拡散は、最近、遅延拡散の強力な代替として再燃し、事前訓練されたオートエンコーダを使わずに高品質な生成を可能にする。しかし、標準的な画素空間拡散モデルは比較的弱い意味的監督を受けており、ハイレベルな視覚構造を捉えるように設計されていない。最近の表現アライメント法(例えばREPA)では、事前学習した視覚的特徴は拡散訓練を著しく改善し、視覚的コデノゲーションは、そのような特徴を生成過程に組み込むための有望な方向として現れている。しかし、既存の共同設計アプローチは、しばしば複数の設計選択を絡み合わせるため、どの設計選択が真に必須かははっきりしない。そこで我々は,統合されたJITフレームワークにおける視覚的コデノゲーションの体系的研究であるV-Coを提案する。この制御された設定により、視覚的コデノゲーションを効果的にするための成分を分離することができる。本研究は,視覚的コデノジングを効果的に行うための4つの重要な要素を明らかにする。第一に、フレキシブルなクロスストリームインタラクションを可能にしながら、機能固有の計算を保存することは、完全なデュアルストリームアーキテクチャを動機付けます。第二に、効果的な分類器フリーガイダンス(CFG)は構造的に定義された非条件予測を必要とする。第三に、より強力な意味的監督は、知覚的ドリフトのハイブリッド損失によって得られるのが最適である。第4に、安定したコデノナイジングには、適切なクロスストリームキャリブレーションが必要であり、RMSベースの機能再スケーリングによって実現される。これらの知見は共に、視覚的コデノジングの簡単なレシピを生み出している。 ImageNet-256の実験では、V-Coは、より少ないトレーニングエポックを使用しながら、基礎となるピクセル空間拡散ベースラインと強い事前拡散法を上回り、将来的な表現整列生成モデルの実用的なガイダンスを提供する。

論文の概要: V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

関連論文リスト