In Pursuit of Pixel Supervision for Visual Pre-training
- URL: http://arxiv.org/abs/2512.15715v1
- Date: Wed, 17 Dec 2025 18:59:58 GMT
- Title: In Pursuit of Pixel Supervision for Visual Pre-training
- Authors: Lihe Yang, Shang-Wen Li, Yang Li, Xinjie Lei, Dong Wang, Abdelrahman Mohamed, Hengshuang Zhao, Hu Xu,
- Abstract summary: "Pixio" is an enhanced masked autoencoder (MAE) trained on 2B web-crawled images with a self-curation strategy with minimal human curation.<n>Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation, feed-forward 3D reconstruction, semantic segmentation, and robot learning.<n>Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.
- Score: 60.63095313440605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed "Pixio", is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.
Related papers
- DeFM: Learning Foundation Representations from Depth for Robotics [49.77188649197404]
We present DeFM, a self-supervised foundation model trained entirely on depth images for robotic applications.<n>DeFM learns geometric and semantic representations that generalize to diverse environments, tasks, and sensors.<n>It achieves state-of-the-art performance and demonstrates strong generalization from simulation to real-world environments.
arXiv Detail & Related papers (2026-01-26T19:45:31Z) - Perception Encoder: The best visual embeddings are not at the output of the network [70.86738083862099]
We introduce Perception (PE), a vision encoder for image and video understanding trained via simple vision-language learning.<n>We find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks.<n>Together, our PE family of models achieves best-in-class results on a wide variety of tasks.
arXiv Detail & Related papers (2025-04-17T17:59:57Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [111.16358607889609]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.<n>For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Learning Invariant World State Representations with Predictive Coding [1.8963850600275547]
We develop a new predictive coding-based architecture and a hybrid fully-supervised/self-supervised learning method.
We evaluate the robustness of our model on a new synthetic dataset.
arXiv Detail & Related papers (2022-07-06T21:08:30Z) - Self-supervised Learning for Sonar Image Classification [6.1947705963945845]
Self-supervised learning has proved to be a powerful approach to learn image representations without the need of large labeled datasets.
We present pre-training and transfer learning results on real-life sonar image datasets.
arXiv Detail & Related papers (2022-04-20T08:58:35Z) - The Devil is in the Frequency: Geminated Gestalt Autoencoder for
Self-Supervised Visual Pre-Training [13.087987450384036]
We present a new Masked Image Modeling (MIM), termed Geminated Autoencoder (Ge$2$-AE) for visual pre-training.
Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space.
arXiv Detail & Related papers (2022-04-18T09:22:55Z) - Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised
Visual Representation Learning [60.75687261314962]
We introduce pixel-level pretext tasks for learning dense feature representations.
A pixel-to-propagation consistency task produces better results than state-of-the-art approaches.
Results demonstrate the strong potential of defining pretext tasks at the pixel level.
arXiv Detail & Related papers (2020-11-19T18:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.