U-REPA: Aligning Diffusion U-Nets to ViTs
- URL: http://arxiv.org/abs/2503.18414v1
- Date: Mon, 24 Mar 2025 07:46:00 GMT
- Title: U-REPA: Aligning Diffusion U-Nets to ViTs
- Authors: Yuchuan Tian, Hanting Chen, Mengyu Zheng, Yuchen Liang, Chao Xu, Yunhe Wang,
- Abstract summary: We propose U-REPA, a representation alignment paradigm that bridges U-Net hidden states and ViT features.<n>Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed.
- Score: 29.987838381433445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hidden-states with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net's spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose U-REPA, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach $FID<1.5$ in 200 epochs or 1M iterations on ImageNet 256 $\times$ 256, and needs only half the total epochs to perform better than REPA. Codes are available at https://github.com/YuchuanTian/U-REPA.
Related papers
- CoIFNet: A Unified Framework for Multivariate Time Series Forecasting with Missing Values [17.25081407284703]
Collaborative Imputation-Forecasting Network (CoIFNet) is a novel framework that unifies imputation and forecasting.<n>CoIFNet takes the observed values, mask matrix and timestamp embeddings as input, processing them sequentially.<n>We demonstrate the effectiveness and computational efficiency of our proposed approach across diverse missing-data scenarios.
arXiv Detail & Related papers (2025-06-16T03:15:12Z) - RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision [7.721101317599364]
We propose a hierarchical dense positive supervision method based on RT-DETR, named RT-DETRv3.
To address insufficient decoder training, we propose a novel learning strategy involving self-attention perturbation.
RT-DETRv3 significantly outperforms existing real-time detectors, including the RT-DETR series and the YOLO series.
arXiv Detail & Related papers (2024-09-13T02:02:07Z) - U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers [28.936553798624136]
We propose a series of U-shaped DiTs (U-DiTs) in the paper and conduct extensive experiments to demonstrate the extraordinary performance of U-DiT models.
The proposed U-DiT could outperform DiT-XL/2 with only 1/6 of its cost computation.
arXiv Detail & Related papers (2024-05-04T18:27:29Z) - TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection [65.59969454655996]
We propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions.
Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks.
We also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings.
arXiv Detail & Related papers (2024-03-26T17:46:25Z) - AANet: Aggregation and Alignment Network with Semi-hard Positive Sample
Mining for Hierarchical Place Recognition [48.043749855085025]
Visual place recognition (VPR) is one of the research hotspots in robotics, which uses visual information to locate robots.
We present a unified network capable of extracting global features for retrieving candidates via an aggregation module.
We also propose a Semi-hard Positive Sample Mining (ShPSM) strategy to select appropriate hard positive images for training more robust VPR networks.
arXiv Detail & Related papers (2023-10-08T14:46:11Z) - SFNet: Faster and Accurate Semantic Segmentation via Semantic Flow [88.97790684009979]
A common practice to improve the performance is to attain high-resolution feature maps with strong semantic representation.
We propose a Flow Alignment Module (FAM) to learn textitSemantic Flow between feature maps of adjacent levels.
We also present a novel Gated Dual Flow Alignment Module to directly align high-resolution feature maps and low-resolution feature maps.
arXiv Detail & Related papers (2022-07-10T08:25:47Z) - Accelerating DETR Convergence via Semantic-Aligned Matching [50.3633635846255]
This paper presents SAM-DETR, a Semantic-Aligned-Matching DETR that greatly accelerates DETR's convergence without sacrificing its accuracy.
It explicitly searches salient points with the most discriminative features for semantic-aligned matching, which further speeds up the convergence and boosts detection accuracy as well.
arXiv Detail & Related papers (2022-03-14T06:50:51Z) - Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain
Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) has recently demonstrated promise in computer vision problems.
ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity.
We propose two techniques to mitigate the undesirable low-pass limitation.
arXiv Detail & Related papers (2022-03-09T23:55:24Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z) - RCNet: Reverse Feature Pyramid and Cross-scale Shift Network for Object
Detection [10.847953426161924]
We propose RCNet, which consists of Reverse Feature Pyramid (RevFP) and Cross-scale Shift Network (CSN)
RevFP utilizes local bidirectional feature fusion to simplify the bidirectional pyramid inference pipeline.
CSN directly propagates representations to both adjacent and non-adjacent levels to enable multi-scale features more correlative.
arXiv Detail & Related papers (2021-10-23T04:00:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.