DepthFocus: Controllable Depth Estimation for See-Through Scenes
- URL: http://arxiv.org/abs/2511.16993v1
- Date: Fri, 21 Nov 2025 06:59:54 GMT
- Title: DepthFocus: Controllable Depth Estimation for See-Through Scenes
- Authors: Junhong Min, Jimin Kim, Cheol-Hui Min, Minwook Kim, Youngpil Jeon, Minyong Choi,
- Abstract summary: We introduce DepthFocus, a steerable Vision Transformer that redefines stereo depth estimation as intent-driven control.<n>Conditioned on a scalar depth preference, the model dynamically adapts its computation to focus on the intended depth, enabling selective perception within complex scenes.
- Score: 2.934725935750573
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Depth in the real world is rarely singular. Transmissive materials create layered ambiguities that confound conventional perception systems. Existing models remain passive, attempting to estimate static depth maps anchored to the nearest surface, while humans actively shift focus to perceive a desired depth. We introduce DepthFocus, a steerable Vision Transformer that redefines stereo depth estimation as intent-driven control. Conditioned on a scalar depth preference, the model dynamically adapts its computation to focus on the intended depth, enabling selective perception within complex scenes. The training primarily leverages our newly constructed 500k multi-layered synthetic dataset, designed to capture diverse see-through effects. DepthFocus not only achieves state-of-the-art performance on conventional single-depth benchmarks like BOOSTER, a dataset notably rich in transparent and reflective objects, but also quantitatively demonstrates intent-aligned estimation on our newly proposed real and synthetic multi-depth datasets. Moreover, it exhibits strong generalization capabilities on unseen see-through scenes, underscoring its robustness as a significant step toward active and human-like 3D perception.
Related papers
- Depth Jitter: Seeing through the Depth [2.2842607238440857]
We introduce Depth-Jitter, a novel depth-based augmentation technique that simulates natural depth variations to improve generalizations.<n>Our approach applies adaptive depth offsetting, guided by depth variance thresholds, to generate synthetic depth perturbations.<n>We evaluate Depth-Jitter on two benchmark datasets, FathomNet and UTDAC 2020.
arXiv Detail & Related papers (2025-08-08T11:14:57Z) - Rethinking Transparent Object Grasping: Depth Completion with Monocular Depth Estimation and Instance Mask [10.472380465235629]
ReMake is a novel depth completion framework guided by an instance mask and monocular depth estimation.<n>Our method outperforms existing approaches on both benchmark datasets and real-world scenarios.
arXiv Detail & Related papers (2025-08-04T15:14:47Z) - Seurat: From Moving Points to Depth [66.65189052568209]
We propose a novel method that infers relative depth by examining the spatial relationships and temporal evolution of a set of tracked 2D trajectories.<n>Our approach achieves temporally smooth, high-accuracy depth predictions across diverse domains.
arXiv Detail & Related papers (2025-04-20T17:37:02Z) - Metric-Solver: Sliding Anchored Metric Depth Estimation from a Single Image [51.689871870692194]
Metric-r is a novel sliding anchor-based metric depth estimation method.<n>Our design enables a unified and adaptive depth representation across diverse environments.
arXiv Detail & Related papers (2025-04-16T14:12:25Z) - ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation [62.600382533322325]
We propose a novel monocular depth estimation method called ScaleDepth.
Our method decomposes metric depth into scene scale and relative depth, and predicts them through a semantic-aware scale prediction module.
Our method achieves metric depth estimation for both indoor and outdoor scenes in a unified framework.
arXiv Detail & Related papers (2024-07-11T05:11:56Z) - Transparent Object Depth Completion [11.825680661429825]
The perception of transparent objects for grasp and manipulation remains a major challenge.
Existing robotic grasp methods which heavily rely on depth maps are not suitable for transparent objects due to their unique visual properties.
We propose an end-to-end network for transparent object depth completion that combines the strengths of single-view RGB-D based depth completion and multi-view depth estimation.
arXiv Detail & Related papers (2024-05-24T07:38:06Z) - Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation [23.93080319283679]
Existing methods jointly estimate pixel-wise depth and motion, relying mainly on an image reconstruction loss.
Dynamic regions1 remain a critical challenge for these methods due to the inherent ambiguity in depth and motion estimation.
This paper proposes a self-supervised training framework exploiting pseudo depth labels for dynamic regions from training data.
arXiv Detail & Related papers (2024-04-23T10:51:15Z) - Depth-aware Volume Attention for Texture-less Stereo Matching [67.46404479356896]
We propose a lightweight volume refinement scheme to tackle the texture deterioration in practical outdoor scenarios.
We introduce a depth volume supervised by the ground-truth depth map, capturing the relative hierarchy of image texture.
Local fine structure and context are emphasized to mitigate ambiguity and redundancy during volume aggregation.
arXiv Detail & Related papers (2024-02-14T04:07:44Z) - SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for
Dynamic Scenes [58.89295356901823]
Self-supervised monocular depth estimation has shown impressive results in static scenes.
It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions.
We introduce an external pretrained monocular depth estimation model for generating single-image depth prior.
Our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes.
arXiv Detail & Related papers (2022-11-07T16:17:47Z) - DnD: Dense Depth Estimation in Crowded Dynamic Indoor Scenes [68.38952377590499]
We present a novel approach for estimating depth from a monocular camera as it moves through complex indoor environments.
Our approach predicts absolute scale depth maps over the entire scene consisting of a static background and multiple moving people.
arXiv Detail & Related papers (2021-08-12T09:12:39Z) - Self-Supervised Joint Learning Framework of Depth Estimation via
Implicit Cues [24.743099160992937]
We propose a novel self-supervised joint learning framework for depth estimation.
The proposed framework outperforms the state-of-the-art(SOTA) on KITTI and Make3D datasets.
arXiv Detail & Related papers (2020-06-17T13:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.