Copy-Pasting Coherent Depth Regions Improves Contrastive Learning for
Urban-Scene Segmentation
- URL:
- Date: Fri, 25 Nov 2022 12:52:08 GMT
- Title: Copy-Pasting Coherent Depth Regions Improves Contrastive Learning for
Urban-Scene Segmentation
- Authors: Liang Zeng, Attila Lengyel, Nergis T\"omen, Jan van Gemert
- Abstract summary: We group coherent, semantically related pixels into coherent depth regions given their estimated depth.
Our method surpasses the previous state-of-the-art baseline by +7.14% in mIoU on Cityscapes and +6.65% on KITTI.
- Score: 14.742591401997048
- License:
- Abstract: In this work, we leverage estimated depth to boost self-supervised
contrastive learning for segmentation of urban scenes, where unlabeled videos
are readily available for training self-supervised depth estimation. We argue
that the semantics of a coherent group of pixels in 3D space is self-contained
and invariant to the contexts in which they appear. We group coherent,
semantically related pixels into coherent depth regions given their estimated
depth and use copy-paste to synthetically vary their contexts. In this way,
cross-context correspondences are built in contrastive learning and a
context-invariant representation is learned. For unsupervised semantic
segmentation of urban scenes, our method surpasses the previous
state-of-the-art baseline by +7.14% in mIoU on Cityscapes and +6.65% on KITTI.
For fine-tuning on Cityscapes and KITTI segmentation, our method is competitive
with existing models, yet, we do not need to pre-train on ImageNet or COCO, and
we are also more computationally efficient. Our code is available on
Related papers
- S3PT: Scene Semantics and Structure Guided Clustering to Boost Self-Supervised Pre-Training for Autonomous Driving [12.406655155106424]
We propose S3PT a novel scene semantics and structure guided clustering to provide more scene-consistent objectives for self-supervised training.
Our contributions are threefold: First, we incorporate semantic distribution consistent clustering to encourage better representation of rare classes such as motorcycles or animals.
Second, we introduce object diversity consistent spatial clustering, to handle imbalanced and diverse object sizes, ranging from large background areas to small objects such as pedestrians and traffic signs.
Third, we propose a depth-guided spatial clustering to regularize learning based on geometric information of the scene, thus further refining region separation on the feature level.
arXiv Detail & Related papers (2024-10-30T15:00:06Z) - Transferring to Real-World Layouts: A Depth-aware Framework for Scene Adaptation [34.786268652516355]
Scene segmentation via unsupervised domain adaptation (UDA) enables the transfer of knowledge acquired from source synthetic data to real-world target data.
We propose a depth-aware framework to explicitly leverage depth estimation to mix the categories and facilitate the two complementary tasks, i.e., segmentation and depth learning.
In particular, the framework contains a Depth-guided Contextual Filter (DCF) forndata augmentation and a cross-task encoder for contextual learning.
arXiv Detail & Related papers (2023-11-21T15:39:21Z) - Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation [31.34615135846137]
We propose a few-shot-based method which learns to adapt the Vision-Language Models for monocular depth estimation.
Specifically, it assigns different depth bins for different scenes, which can be selected by the model during inference.
With only one image per scene for training, our extensive experiment results on the NYU V2 and KITTI dataset demonstrate that our method outperforms the previous state-of-the-art method by up to 10.6% in terms of MARE.
arXiv Detail & Related papers (2023-11-02T06:56:50Z) - X-Distill: Improving Self-Supervised Monocular Depth via Cross-Task
Distillation [69.9604394044652]
We propose a novel method to improve the self-supervised training of monocular depth via cross-task knowledge distillation.
During training, we utilize a pretrained semantic segmentation teacher network and transfer its semantic knowledge to the depth network.
We extensively evaluate the efficacy of our proposed approach on the KITTI benchmark and compare it with the latest state of the art.
arXiv Detail & Related papers (2021-10-24T19:47:14Z) - Improving Semi-Supervised and Domain-Adaptive Semantic Segmentation with
Self-Supervised Depth Estimation [94.16816278191477]
We present a framework for semi-adaptive and domain-supervised semantic segmentation.
It is enhanced by self-supervised monocular depth estimation trained only on unlabeled image sequences.
We validate the proposed model on the Cityscapes dataset.
arXiv Detail & Related papers (2021-08-28T01:33:38Z) - Learning Depth via Leveraging Semantics: Self-supervised Monocular Depth
Estimation with Both Implicit and Explicit Semantic Guidance [34.62415122883441]
We propose a Semantic-aware Spatial Feature Alignment scheme to align implicit semantic features with depth features for scene-aware depth estimation.
We also propose a semantic-guided ranking loss to explicitly constrain the estimated depth maps to be consistent with real scene contextual properties.
Our method produces high quality depth maps which are consistently superior either on complex scenes or diverse semantic categories.
arXiv Detail & Related papers (2021-02-11T14:29:51Z) - Exploring Cross-Image Pixel Contrast for Semantic Segmentation [130.22216825377618]
We propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting.
The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes.
Our method can be effortlessly incorporated into existing segmentation frameworks without extra overhead during testing.
arXiv Detail & Related papers (2021-01-28T11:35:32Z) - Three Ways to Improve Semantic Segmentation with Self-Supervised Depth
Estimation [90.87105131054419]
We present a framework for semi-supervised semantic segmentation, which is enhanced by self-supervised monocular depth estimation from unlabeled image sequences.
We validate the proposed model on the Cityscapes dataset, where all three modules demonstrate significant performance gains.
arXiv Detail & Related papers (2020-12-19T21:18:03Z) - Dense Contrastive Learning for Self-Supervised Visual Pre-Training [102.15325936477362]
We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only 1% slower)
arXiv Detail & Related papers (2020-11-18T08:42:32Z) - Context Prior for Scene Segmentation [118.46210049742993]
We develop a Context Prior with the supervision of the Affinity Loss.
The learned Context Prior extracts the pixels belonging to the same category, while the reversed prior focuses on the pixels of different classes.
Our algorithm achieves 46.3% mIoU on ADE20K, 53.9% mIoU on PASCAL-Context, and 81.3% mIoU on Cityscapes.
arXiv Detail & Related papers (2020-04-03T13:16:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.