Transferring to Real-World Layouts: A Depth-aware Framework for Scene Adaptation
- URL: http://arxiv.org/abs/2311.12682v2
- Date: Wed, 31 Jul 2024 10:37:00 GMT
- Title: Transferring to Real-World Layouts: A Depth-aware Framework for Scene Adaptation
- Authors: Mu Chen, Zhedong Zheng, Yi Yang,
- Abstract summary: Scene segmentation via unsupervised domain adaptation (UDA) enables the transfer of knowledge acquired from source synthetic data to real-world target data.
We propose a depth-aware framework to explicitly leverage depth estimation to mix the categories and facilitate the two complementary tasks, i.e., segmentation and depth learning.
In particular, the framework contains a Depth-guided Contextual Filter (DCF) forndata augmentation and a cross-task encoder for contextual learning.
- Score: 34.786268652516355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene segmentation via unsupervised domain adaptation (UDA) enables the transfer of knowledge acquired from source synthetic data to real-world target data, which largely reduces the need for manual pixel-level annotations in the target domain. To facilitate domain-invariant feature learning, existing methods typically mix data from both the source domain and target domain by simply copying and pasting the pixels. Such vanilla methods are usually sub-optimal since they do not take into account how well the mixed layouts correspond to real-world scenarios. Real-world scenarios are with an inherent layout. We observe that semantic categories, such as sidewalks, buildings, and sky, display relatively consistent depth distributions, and could be clearly distinguished in a depth map. Based on such observation, we propose a depth-aware framework to explicitly leverage depth estimation to mix the categories and facilitate the two complementary tasks, i.e., segmentation and depth learning in an end-to-end manner. In particular, the framework contains a Depth-guided Contextual Filter (DCF) forndata augmentation and a cross-task encoder for contextual learning. DCF simulates the real-world layouts, while the cross-task encoder further adaptively fuses the complementing features between two tasks. Besides, it is worth noting that several public datasets do not provide depth annotation. Therefore, we leverage the off-the-shelf depth estimation network to generate the pseudo depth. Extensive experiments show that our proposed methods, even with pseudo depth, achieve competitive performance on two widely-used bench-marks, i.e. 77.7 mIoU on GTA to Cityscapes and 69.3 mIoU on Synthia to Cityscapes.
Related papers
- MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation [155.0797148367653]
Unsupervised Domain Adaptation (UDA) is the task of bridging the domain gap between a labeled source domain and an unlabeled target domain.
We propose to leverage geometric information, i.e., depth predictions, as depth discontinuities often coincide with segmentation boundaries.
We show that our method can be plugged into various recent UDA methods and consistently improve results across standard UDA benchmarks.
arXiv Detail & Related papers (2024-08-29T12:15:10Z) - Progressive Depth Decoupling and Modulating for Flexible Depth Completion [28.693100885012008]
Image-guided depth completion aims at generating a dense depth map from sparse LiDAR data and RGB image.
Recent methods have shown promising performance by reformulating it as a classification problem with two sub-tasks: depth discretization and probability prediction.
We propose a progressive depth decoupling and modulating network, which incrementally decouples the depth range into bins and adaptively generates multi-scale dense depth maps.
arXiv Detail & Related papers (2024-05-15T13:45:33Z) - Depth-aware Volume Attention for Texture-less Stereo Matching [67.46404479356896]
We propose a lightweight volume refinement scheme to tackle the texture deterioration in practical outdoor scenarios.
We introduce a depth volume supervised by the ground-truth depth map, capturing the relative hierarchy of image texture.
Local fine structure and context are emphasized to mitigate ambiguity and redundancy during volume aggregation.
arXiv Detail & Related papers (2024-02-14T04:07:44Z) - BridgeNet: A Joint Learning Network of Depth Map Super-Resolution and
Monocular Depth Estimation [60.34562823470874]
We propose a joint learning network of depth map super-resolution (DSR) and monocular depth estimation (MDE) without introducing additional supervision labels.
One is the high-frequency attention bridge (HABdg) designed for the feature encoding process, which learns the high-frequency information of the MDE task to guide the DSR task.
The other is the content guidance bridge (CGBdg) designed for the depth map reconstruction process, which provides the content guidance learned from DSR task for MDE task.
arXiv Detail & Related papers (2021-07-27T01:28:23Z) - Domain Adaptive Semantic Segmentation with Self-Supervised Depth
Estimation [84.34227665232281]
Domain adaptation for semantic segmentation aims to improve the model performance in the presence of a distribution shift between source and target domain.
We leverage the guidance from self-supervised depth estimation, which is available on both domains, to bridge the domain gap.
We demonstrate the effectiveness of our proposed approach on the benchmark tasks SYNTHIA-to-Cityscapes and GTA-to-Cityscapes.
arXiv Detail & Related papers (2021-04-28T07:47:36Z) - S2R-DepthNet: Learning a Generalizable Depth-specific Structural
Representation [63.58891781246175]
Human can infer the 3D geometry of a scene from a sketch instead of a realistic image, which indicates that the spatial structure plays a fundamental role in understanding the depth of scenes.
We are the first to explore the learning of a depth-specific structural representation, which captures the essential feature for depth estimation and ignores irrelevant style information.
Our S2R-DepthNet can be well generalized to unseen real-world data directly even though it is only trained on synthetic data.
arXiv Detail & Related papers (2021-04-02T03:55:41Z) - Learning Depth via Leveraging Semantics: Self-supervised Monocular Depth
Estimation with Both Implicit and Explicit Semantic Guidance [34.62415122883441]
We propose a Semantic-aware Spatial Feature Alignment scheme to align implicit semantic features with depth features for scene-aware depth estimation.
We also propose a semantic-guided ranking loss to explicitly constrain the estimated depth maps to be consistent with real scene contextual properties.
Our method produces high quality depth maps which are consistently superior either on complex scenes or diverse semantic categories.
arXiv Detail & Related papers (2021-02-11T14:29:51Z) - Three Ways to Improve Semantic Segmentation with Self-Supervised Depth
Estimation [90.87105131054419]
We present a framework for semi-supervised semantic segmentation, which is enhanced by self-supervised monocular depth estimation from unlabeled image sequences.
We validate the proposed model on the Cityscapes dataset, where all three modules demonstrate significant performance gains.
arXiv Detail & Related papers (2020-12-19T21:18:03Z) - Semantic-Guided Representation Enhancement for Self-supervised Monocular
Trained Depth Estimation [39.845944724079814]
Self-supervised depth estimation has shown its great effectiveness in producing high quality depth maps given only image sequences as input.
However, its performance usually drops when estimating on border areas or objects with thin structures due to the limited depth representation ability.
We propose a semantic-guided depth representation enhancement method, which promotes both local and global depth feature representations.
arXiv Detail & Related papers (2020-12-15T02:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.