What matters for Representation Alignment: Global Information or Spatial Structure?
- URL: http://arxiv.org/abs/2512.10794v1
- Date: Thu, 11 Dec 2025 16:39:53 GMT
- Title: What matters for Representation Alignment: Global Information or Spatial Structure?
- Authors: Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, Saining Xie,
- Abstract summary: Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features.<n>We investigate a fundamental question: what aspect of the target representation matters for generation, its textitglobal revisionsemantic information.<n>We replace the standard projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation.
- Score: 64.67092609921816
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of \emph{spatial} information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in $<$4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at https://end2end-diffusion.github.io/irepa
Related papers
- Inceptive Transformers: Enhancing Contextual Representations through Multi-Scale Feature Learning Across Domains and Languages [3.294155819837931]
transformer models compress information from all tokens in a sequence into a single [/] token to represent global context.<n>This approach risks diluting fine-grained or hierarchical features, leading to information loss in downstream tasks where local patterns are important.<n>We propose an inception-style 1-D convolution module that sits on top of the transformer layer and augments token representations with multi-scale local features.
arXiv Detail & Related papers (2025-05-26T19:59:22Z) - Mapping representations in Reinforcement Learning via Semantic Alignment for Zero-Shot Stitching [17.76990521486307]
Deep Reinforcement Learning models often fail to generalize when even small changes occur in the environment's observations or task requirements.<n>We propose a zero-shot method for mapping between latent spaces across different agents trained on different visual and task variations.<n>We empirically demonstrate zero-shot stitching performance on the CarRacing environment with changing background and task.
arXiv Detail & Related papers (2025-02-26T22:06:00Z) - How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self Distillation Networks [14.338754598043968]
Two competing paradigms exist for self-supervised learning of data representations.
Joint Embedding Predictive Architecture (JEPA) is a class of architectures in which semantically similar inputs are encoded into representations that are predictive of each other.
arXiv Detail & Related papers (2024-07-03T19:43:12Z) - Masked Completion via Structured Diffusion with White-Box Transformers [23.07048591213815]
We provide the first instantiation of the white-box design paradigm that can be applied to large-scale unsupervised representation learning.
We do this by exploiting a fundamental connection between diffusion, compression, and (masked) completion, deriving a deep transformer-like masked autoencoder architecture.
CRATE-MAE demonstrates highly promising performance on large-scale imagery datasets.
arXiv Detail & Related papers (2024-04-03T04:23:01Z) - Exploring and Exploiting Multi-Granularity Representations for Machine
Reading Comprehension [13.191437539419681]
We propose a novel approach called Adaptive Bidirectional Attention-Capsule Network (ABA-Net)
ABA-Net adaptively exploits the source representations of different levels to the predictor.
We set the new state-of-the-art performance on the SQuAD 1.0 dataset.
arXiv Detail & Related papers (2022-08-18T10:14:32Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - FedAvg with Fine Tuning: Local Updates Lead to Representation Learning [54.65133770989836]
Federated Averaging (FedAvg) algorithm consists of alternating between a few local gradient updates at client nodes, followed by a model averaging update at the server.
We show that the reason behind generalizability of the FedAvg's output is its power in learning the common data representation among the clients' tasks.
We also provide empirical evidence demonstrating FedAvg's representation learning ability in federated image classification with heterogeneous data.
arXiv Detail & Related papers (2022-05-27T00:55:24Z) - Unifying Global-Local Representations in Salient Object Detection with Transformer [55.23033277636774]
We introduce a new attention-based encoder, vision transformer, into salient object detection.
With the global view in very shallow layers, the transformer encoder preserves more local representations.
Our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks.
arXiv Detail & Related papers (2021-08-05T17:51:32Z) - HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance.
Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.