LayerLock: Non-collapsing Representation Learning with Progressive Freezing
- URL: http://arxiv.org/abs/2509.10156v3
- Date: Tue, 30 Sep 2025 09:26:26 GMT
- Title: LayerLock: Non-collapsing Representation Learning with Progressive Freezing
- Authors: Goker Erdogan, Nikhil Parthasarathy, Catalin Ionescu, Drew A. Hudson, Alexander Lerchner, Andrew Zisserman, Mehdi S. M. Sajjadi, Joao Carreira,
- Abstract summary: We introduce LayerLock, a simple yet effective approach for self-supervised visual representation learning.<n>We make the observation that during training of video masked-autoencoding (MAE) models, ViT layers converge in the order of their depth.<n>We show that this observation can be exploited to accelerate standard MAE by progressively freezing the model according to an explicit schedule.
- Score: 74.78054305471325
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce LayerLock, a simple yet effective approach for self-supervised visual representation learning, that gradually transitions from pixel to latent prediction through progressive layer freezing. First, we make the observation that during training of video masked-autoencoding (MAE) models, ViT layers converge in the order of their depth: shallower layers converge early, deeper layers converge late. We then show that this observation can be exploited to accelerate standard MAE by progressively freezing the model according to an explicit schedule, throughout training. Furthermore, this same schedule can be used in a simple and scalable approach to latent prediction that does not suffer from "representation collapse". We apply our proposed approach, LayerLock, to large models of up to 4B parameters with results surpassing those of non-latent masked prediction on the 4DS perception suite.
Related papers
- From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction [22.291273919939957]
We develop a scalable synthetic data pipeline that generates human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks.<n>We train a unified ViT-based dense predictor that injects an explicit geometric human prior via CSE embeddings.<n>Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences.
arXiv Detail & Related papers (2026-02-02T05:28:58Z) - Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations [53.91818843831925]
We propose NExT-Vid, a novel autoregressive visual generative pretraining framework.<n>We introduce a context-isolated autoregressive predictor to decouple semantic representation from target decoding.<n>Through context-isolated flow-matching pretraining, our approach achieves strong representations.
arXiv Detail & Related papers (2025-12-24T07:07:08Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding [13.747101397628887]
We present an end-to-end solution to speed-up inference of large language models (LLMs)
We apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit.
We show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model.
arXiv Detail & Related papers (2024-04-25T16:20:23Z) - Skeleton2vec: A Self-supervised Learning Framework with Contextualized
Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance.
Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework.
Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z) - Pair-wise Layer Attention with Spatial Masking for Video Prediction [46.17429511620538]
We develop a Pair-wise Layer Attention (PLA) module to enhance the layer-wise semantic dependency of the feature maps.
We also present a Pair-wise Layer Attention with Spatial Masking (SM-SM) framework for Translator prediction.
arXiv Detail & Related papers (2023-11-19T10:29:05Z) - ContrastMotion: Self-supervised Scene Motion Learning for Large-Scale
LiDAR Point Clouds [21.6511040107249]
We propose a novel self-supervised motion estimator for LiDAR-based autonomous driving via BEV representation.
We predict scene motion via feature-level consistency between pillars in consecutive frames, which can eliminate the effect caused by noise points and view-changing point clouds in dynamic scenes.
arXiv Detail & Related papers (2023-04-25T05:46:24Z) - Jump to Conclusions: Short-Cutting Transformers With Linear Transformations [60.37563766047492]
Transformer-based language models create hidden representations of their inputs at every layer, but only use final-layer representations for prediction.
This obscures the internal decision-making process of the model and the utility of its intermediate representations.
We suggest a simple method for such casting, using linear transformations.
arXiv Detail & Related papers (2023-03-16T16:10:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.