From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations
- URL: http://arxiv.org/abs/2508.15404v1
- Date: Thu, 21 Aug 2025 09:52:12 GMT
- Title: From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations
- Authors: Anthony Bisulco, Rahul Ramesh, Randall Balestriero, Pratik Chaudhari,
- Abstract summary: Masked Autoencoders (MAEs) have emerged as a powerful pretraining technique for vision foundation models.<n>This work investigates how MAEs learn spatial correlations in the input image.<n>We analytically derive the features learned by a linear MAE and show that masking ratio and patch size can be used to select for features that capture short- and long-range spatial correlations.
- Score: 33.47980409127128
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked Autoencoders (MAEs) have emerged as a powerful pretraining technique for vision foundation models. Despite their effectiveness, they require extensive hyperparameter tuning (masking ratio, patch size, encoder/decoder layers) when applied to novel datasets. While prior theoretical works have analyzed MAEs in terms of their attention patterns and hierarchical latent variable models, the connection between MAE hyperparameters and performance on downstream tasks is relatively unexplored. This work investigates how MAEs learn spatial correlations in the input image. We analytically derive the features learned by a linear MAE and show that masking ratio and patch size can be used to select for features that capture short- and long-range spatial correlations. We extend this analysis to non-linear MAEs to show that MAE representations adapt to spatial correlations in the dataset, beyond second-order statistics. Finally, we discuss some insights on how to select MAE hyper-parameters in practice.
Related papers
- Meta-Learning Linear Models for Molecular Property Prediction [3.9685594339912633]
We introduce LAMeL - a Linear Algorithm for Meta-Learning that preserves interpretability while improving the prediction accuracy across multiple properties.<n>Our method delivers performance improvements ranging from 1.1- to 25-fold over standard ridge regression, depending on the domain of the dataset.
arXiv Detail & Related papers (2025-09-16T20:41:45Z) - Enhancing DNA Foundation Models to Address Masking Inefficiencies [18.54660252939211]
We propose a modified encoder-decoder architecture based on the masked autoencoder framework.<n>We evaluate our approach on the BIOSCAN-5M dataset, comprising over 2 million unique DNA barcodes.
arXiv Detail & Related papers (2025-02-25T17:56:25Z) - Exploring Patterns Behind Sports [3.2838877620203935]
This paper presents a comprehensive framework for time series prediction using a hybrid model that combines ARIMA and LSTM.<n>The model incorporates feature engineering techniques, including embedding and PCA, to transform raw data into a lower-dimensional representation.
arXiv Detail & Related papers (2025-02-11T11:51:07Z) - Enabling Autoregressive Models to Fill In Masked Tokens [50.9948753314669]
This work introduces MARIA (Masked and Autoregressive Infilling Architecture), a novel approach that achieves state-of-the-art masked infilling performance.<n>MARIA combines a pre-trained and AR model by training a linear decoder that takes their hidden states as input.<n>Our results demonstrate that MARIA significantly outperforms existing methods, namely discrete diffusion models, on masked infilling tasks.
arXiv Detail & Related papers (2025-02-09T20:02:05Z) - Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models.
Recent studies extend the SAM to Few-shot Semantic segmentation (FSS)
We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z) - SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement
Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images.
We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy.
Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z) - Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget [10.290956481715387]
Masked Autoencoder Contrastive Tuning (MAE-CT) is a sequential approach that tunes the rich features such that they form semantic clusters of objects without using any labels.
MaE-CT does not rely on hand-crafted augmentations and frequently achieves its best performances while using only minimal augmentations (crop & flip)
MaE-CT excels over previous self-supervised methods trained on ImageNet in linear probing, k-NN and low-shot classification accuracy as well as in unsupervised clustering accuracy.
arXiv Detail & Related papers (2023-04-20T17:51:09Z) - Optimizations of Autoencoders for Analysis and Classification of
Microscopic In Situ Hybridization Images [68.8204255655161]
We propose a deep-learning framework to detect and classify areas of microscopic images with similar levels of gene expression.
The data we analyze requires an unsupervised learning model for which we employ a type of Artificial Neural Network - Deep Learning Autoencoders.
arXiv Detail & Related papers (2023-04-19T13:45:28Z) - EnfoMax: Domain Entropy and Mutual Information Maximization for Domain
Generalized Face Anti-spoofing [0.0]
Face anti-spoofing (FAS) method performs well under intra-domain setups.
The domain generalization (DG) method has gained more attention in FAS.
This paper proposes the EnfoMax framework, which uses information theory to analyze cross-domain FAS tasks.
arXiv Detail & Related papers (2023-02-17T03:54:18Z) - Shortcut Detection with Variational Autoencoders [1.3174512123890016]
We present a novel approach to detect shortcuts in image and audio datasets by leveraging variational autoencoders (VAEs)
The disentanglement of features in the latent space of VAEs allows us to discover feature-target correlations in datasets and semi-automatically evaluate them for ML shortcuts.
We demonstrate the applicability of our method on several real-world datasets and identify shortcuts that have not been discovered before.
arXiv Detail & Related papers (2023-02-08T18:26:10Z) - OST: Efficient One-stream Network for 3D Single Object Tracking in Point Clouds [6.661881950861012]
We propose a novel one-stream network with the strength of the instance-level encoding, which avoids the correlation operations occurring in previous Siamese network.
The proposed method has achieved considerable performance not only for class-specific tracking but also for class-agnostic tracking with less computation and higher efficiency.
arXiv Detail & Related papers (2022-10-16T12:31:59Z) - Dynamically-Scaled Deep Canonical Correlation Analysis [77.34726150561087]
Canonical Correlation Analysis (CCA) is a method for feature extraction of two views by finding maximally correlated linear projections of them.
We introduce a novel dynamic scaling method for training an input-dependent canonical correlation model.
arXiv Detail & Related papers (2022-03-23T12:52:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.