Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation
- URL: http://arxiv.org/abs/2504.12753v3
- Date: Tue, 15 Jul 2025 14:57:45 GMT
- Title: Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation
- Authors: Siyu Chen, Ting Han, Changshe Zhang, Xin Luo, Meiliu Wu, Guorong Cai, Jinhe Su,
- Abstract summary: Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic (DGSS)<n>Recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust.<n>We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2.
- Score: 11.220592454534746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depth-aware learnable tokens to continuously decouple domain-invariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datsets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code is available at https://github.com/anonymouse-xzrptkvyqc/DepthForge.
Related papers
- UDPNet: Unleashing Depth-based Priors for Robust Image Dehazing [77.10640210751981]
UDPNet is a general framework that leverages depth-based priors from a large-scale pretrained depth estimation model DepthAnything V2.<n>Our proposed solution establishes a new benchmark for depth-aware dehazing across various scenarios.
arXiv Detail & Related papers (2026-01-11T13:29:02Z) - WEDepth: Efficient Adaptation of World Knowledge for Monocular Depth Estimation [4.654162664140336]
Modern Vision Foundation Models (VFMs), pre-trained on large-scale diverse datasets, exhibit remarkable world understanding capabilities.<n>We propose WEDepth, a novel approach that adapts VFMs for MDE without modi-fying their structures and pretrained weights.<n>Our method employs the VFM as a multi-level feature en-hancer, systematically injecting prior knowledge at differ-ent representation levels.
arXiv Detail & Related papers (2025-11-11T09:41:27Z) - GCRPNet: Graph-Enhanced Contextual and Regional Perception Network for Salient Object Detection in Optical Remote Sensing Images [68.33481681452675]
We propose a graph-enhanced contextual and regional perception network (GCRPNet)<n>It builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation.<n>It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information.
arXiv Detail & Related papers (2025-08-14T11:31:43Z) - Depth Jitter: Seeing through the Depth [2.2842607238440857]
We introduce Depth-Jitter, a novel depth-based augmentation technique that simulates natural depth variations to improve generalizations.<n>Our approach applies adaptive depth offsetting, guided by depth variance thresholds, to generate synthetic depth perturbations.<n>We evaluate Depth-Jitter on two benchmark datasets, FathomNet and UTDAC 2020.
arXiv Detail & Related papers (2025-08-08T11:14:57Z) - Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation [8.068623902839368]
Open-Vocabulary semantic segmentation (OVSS) and domain generalization in semantic segmentation (DGSS) highlight a subtle complementarity.<n>OV-DGSS aims to generate pixel-level masks for unseen categories while maintaining robustness across unseen domains.<n>We introduce Vireo, a novel single-stage framework for OV-DGSS that unifies the strengths of OVSS and DGSS for the first time.
arXiv Detail & Related papers (2025-06-11T15:54:47Z) - DepthFusion: Depth-Aware Hybrid Feature Fusion for LiDAR-Camera 3D Object Detection [32.07206206508925]
State-of-the-art LiDAR-camera 3D object detectors usually focus on feature fusion.<n>We are the first to observe that different modalities play different roles as depth varies via statistical analysis and visualization.<n>We propose a Depth-Aware Hybrid Feature Fusion strategy that guides the weights of point cloud and RGB image modalities.
arXiv Detail & Related papers (2025-05-12T09:53:00Z) - Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation [24.531539125814877]
Vision Foundation Models (VFMs) are large-scale, pre-trained models that serve as general-purpose backbones for various computer vision tasks.<n>One way to tackle this limitation is by employing a task-agnostic feature upsampling module that refines VFM features resolution.<n>Our benchmarking experiments show that selecting appropriate upsampling strategies significantly improves VFM features quality.
arXiv Detail & Related papers (2025-05-04T11:59:26Z) - DepthMaster: Taming Diffusion Models for Monocular Depth Estimation [41.81343543266191]
We propose a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task.<n>We adopt a two-stage training strategy to fully leverage the potential of the two modules.<n>Our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets.
arXiv Detail & Related papers (2025-01-05T15:18:32Z) - Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.<n>Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.<n>We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z) - DepthLab: From Partial to Complete [80.58276388743306]
Missing values remain a common challenge for depth data across its wide range of applications.<n>This work bridges this gap with DepthLab, a foundation depth inpainting model powered by image diffusion priors.<n>Our approach proves its worth in various downstream tasks, including 3D scene inpainting, text-to-3D scene generation, sparse-view reconstruction with DUST3R, and LiDAR depth completion.
arXiv Detail & Related papers (2024-12-24T04:16:38Z) - Mask-adaptive Gated Convolution and Bi-directional Progressive Fusion Network for Depth Completion [3.5940515868907164]
We propose a new model for depth completion based on an encoder-decoder structure.<n>Our model introduces two key components: the Mask-adaptive Gated Convolution architecture and the Bi-directional Progressive Fusion module.<n>We achieve remarkable performance in completing depth maps and outperformed existing approaches in terms of accuracy and reliability.
arXiv Detail & Related papers (2024-01-15T02:58:06Z) - EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature
Refinement and Regularized Image-Text Alignment [40.328294121805456]
This work builds on the previous work VPD which paved the way to use the Stable Diffusion network for computer vision tasks.
We develop the Inverse Multi-Attentive Feature Refinement (IMAFR) module which enhances feature learning capabilities.
Second, we propose a novel image-text alignment module for improved feature extraction of the Stable Diffusion backbone.
arXiv Detail & Related papers (2023-12-13T22:20:45Z) - Self-Supervised Monocular Depth Estimation by Direction-aware Cumulative
Convolution Network [80.19054069988559]
We find that self-supervised monocular depth estimation shows a direction sensitivity and environmental dependency.
We propose a new Direction-aware Cumulative Convolution Network (DaCCN), which improves the depth representation in two aspects.
Experiments show that our method achieves significant improvements on three widely used benchmarks.
arXiv Detail & Related papers (2023-08-10T14:32:18Z) - Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth
Estimation in Dynamic Scenes [51.20150148066458]
We propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the generalizationally crafted masks.
Experiments on real-world datasets prove the significant effectiveness and ability of the proposed method.
arXiv Detail & Related papers (2023-04-18T13:55:24Z) - Improving Monocular Visual Odometry Using Learned Depth [84.05081552443693]
We propose a framework to exploit monocular depth estimation for improving visual odometry (VO)
The core of our framework is a monocular depth estimation module with a strong generalization capability for diverse scenes.
Compared with current learning-based VO methods, our method demonstrates a stronger generalization ability to diverse scenes.
arXiv Detail & Related papers (2022-04-04T06:26:46Z) - Progressive Multi-scale Fusion Network for RGB-D Salient Object
Detection [9.099589602551575]
We discuss about the advantages of the so-called progressive multi-scale fusion method and propose a mask-guided feature aggregation module.
The proposed framework can effectively combine the two features of different modalities and alleviate the impact of erroneous depth features.
We further introduce a mask-guided refinement module(MGRM) to complement the high-level semantic features and reduce the irrelevant features from multi-scale fusion.
arXiv Detail & Related papers (2021-06-07T20:02:39Z) - Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts.
We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively.
Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively.
Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.