Related papers: DeFM: Learning Foundation Representations from Depth for Robotics

DeFM: Learning Foundation Representations from Depth for Robotics

URL: http://arxiv.org/abs/2601.18923v1
Date: Mon, 26 Jan 2026 19:45:31 GMT
Title: DeFM: Learning Foundation Representations from Depth for Robotics
Authors: Manthan Patel, Jonas Frey, Mayank Mittal, Fan Yang, Alexander Hansson, Amir Bar, Cesar Cadena, Marco Hutter,
Abstract summary: We present DeFM, a self-supervised foundation model trained entirely on depth images for robotic applications.<n>DeFM learns geometric and semantic representations that generalize to diverse environments, tasks, and sensors.<n>It achieves state-of-the-art performance and demonstrates strong generalization from simulation to real-world environments.
Score: 49.77188649197404
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Depth sensors are widely deployed across robotic platforms, and advances in fast, high-fidelity depth simulation have enabled robotic policies trained on depth observations to achieve robust sim-to-real transfer for a wide range of tasks. Despite this, representation learning for depth modality remains underexplored compared to RGB, where large-scale foundation models now define the state of the art. To address this gap, we present DeFM, a self-supervised foundation model trained entirely on depth images for robotic applications. Using a DINO-style self-distillation objective on a curated dataset of 60M depth images, DeFM learns geometric and semantic representations that generalize to diverse environments, tasks, and sensors. To retain metric awareness across multiple scales, we introduce a novel input normalization strategy. We further distill DeFM into compact models suitable for resource-constrained robotic systems. When evaluated on depth-based classification, segmentation, navigation, locomotion, and manipulation benchmarks, DeFM achieves state-of-the-art performance and demonstrates strong generalization from simulation to real-world environments. We release all our pretrained models, which can be adopted off-the-shelf for depth-based robotic learning without task-specific fine-tuning. Webpage: https://de-fm.github.io/

Related papers

Generalizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulation [90.90219129619344]
This paper presents a novel R-prior-S, Recurrent Geometric-priormodal Policy with Spiking features.<n>To ground high-level reasoning in physical reality, we leverage lightweight 2D geometric inductive biases.<n>For the data efficiency issue in robotic action generation, we introduce a Recursive Adaptive Spiking Network.
arXiv Detail & Related papers (2026-01-13T23:36:30Z)
BRIDGE -- Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation [17.554501937884172]
BRIDGE is an RL-optimized depth-to-image (D2I) generation framework.<n>It synthesizes over 20M realistic and geometrically accurate RGB images, each intrinsically paired with its ground truth depth.<n>We train our depth estimation model on this dataset, employing a hybrid supervision strategy.
arXiv Detail & Related papers (2025-09-29T17:19:45Z)
Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots [55.43376513158555]
Camera Depth Models (CDMs) are a simple plugin on daily-use depth cameras.<n>We develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern.<n>For the first time, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots.
arXiv Detail & Related papers (2025-09-02T17:29:38Z)
Optimizing Grasping in Legged Robots: A Deep Learning Approach to Loco-Manipulation [0.6533458718563319]
This paper presents a framework designed to enhance the grasping capabilities of quadrupeds equipped with arms.<n>We developed a pipeline within the Genesis simulation environment to generate a synthetic dataset of grasp attempts on common objects.<n>This dataset was used to train a custom CNN with a U-Net-like architecture that processes multi-modal input from an onboard RGB and depth cameras.<n>We validated the complete framework on a four-legged robot.
arXiv Detail & Related papers (2025-08-24T17:47:56Z)
A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning [67.72413262980272]
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear.<n>We develop SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck.<n>Our approach achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations.
arXiv Detail & Related papers (2025-03-10T06:18:31Z)
SAID-NeRF: Segmentation-AIDed NeRF for Depth Completion of Transparent Objects [7.529049797077149]
Acquiring accurate depth information of transparent objects using off-the-shelf RGB-D cameras is a well-known challenge in Computer Vision and Robotics. NeRFs are learning-free approaches and have demonstrated wide success in novel view synthesis and shape recovery. Our proposed method-AID-NeRF shows significant performance on depth completion datasets for transparent objects and robotic grasping.
arXiv Detail & Related papers (2024-03-28T17:28:32Z)
Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)
Physics-based Differentiable Depth Sensor Simulation [5.134435281973137]
We introduce a novel end-to-end differentiable simulation pipeline for the generation of realistic 2.5D scans. Each module can be differentiated w.r.t sensor and scene parameters. Our simulation greatly improves the performance of the resulting models on real scans.
arXiv Detail & Related papers (2021-03-30T17:59:43Z)
On Deep Learning Techniques to Boost Monocular Depth Estimation for Autonomous Navigation [1.9007546108571112]
Inferring the depth of images is a fundamental inverse problem within the field of Computer Vision. We propose a new lightweight and fast supervised CNN architecture combined with novel feature extraction models. We also introduce an efficient surface normals module, jointly with a simple geometric 2.5D loss function, to solve SIDE problems.
arXiv Detail & Related papers (2020-10-13T18:37:38Z)
Deep Imitation Learning for Bimanual Robotic Manipulation [70.56142804957187]
We present a deep imitation learning framework for robotic bimanual manipulation. A core challenge is to generalize the manipulation skills to objects in different locations. We propose to (i) decompose the multi-modal dynamics into elemental movement primitives, (ii) parameterize each primitive using a recurrent graph neural network to capture interactions, and (iii) integrate a high-level planner that composes primitives sequentially and a low-level controller to combine primitive dynamics and inverse kinematics control.
arXiv Detail & Related papers (2020-10-11T01:40:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.