ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient
Self-Supervised Monocular Depth Estimation
- URL: http://arxiv.org/abs/2212.05729v1
- Date: Mon, 12 Dec 2022 06:38:35 GMT
- Title: ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient
Self-Supervised Monocular Depth Estimation
- Authors: Daitao Xing, Jinglin Shen, Chiuman Ho and Anthony Tzes
- Abstract summary: We propose an efficient local adaptive attention method for geometric aware representation enhancement.
We leverage geometric cues from semantic information to learn local adaptive bounding boxes to guide unsupervised feature aggregation.
Our proposed method establishes a new state-of-the-art in self-supervised monocular depth estimation task.
- Score: 6.923035780685481
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The exploration of mutual-benefit cross-domains has shown great potential
toward accurate self-supervised depth estimation. In this work, we revisit
feature fusion between depth and semantic information and propose an efficient
local adaptive attention method for geometric aware representation enhancement.
Instead of building global connections or deforming attention across the
feature space without restraint, we bound the spatial interaction within a
learnable region of interest. In particular, we leverage geometric cues from
semantic information to learn local adaptive bounding boxes to guide
unsupervised feature aggregation. The local areas preclude most irrelevant
reference points from attention space, yielding more selective feature learning
and faster convergence. We naturally extend the paradigm into a multi-head and
hierarchic way to enable the information distillation in different semantic
levels and improve the feature discriminative ability for fine-grained depth
estimation. Extensive experiments on the KITTI dataset show that our proposed
method establishes a new state-of-the-art in self-supervised monocular depth
estimation task, demonstrating the effectiveness of our approach over former
Transformer variants.
Related papers
- Unified Domain Adaptive Semantic Segmentation [96.74199626935294]
Unsupervised Adaptive Domain Semantic (UDA-SS) aims to transfer the supervision from a labeled source domain to an unlabeled target domain.
We propose a Quad-directional Mixup (QuadMix) method, characterized by tackling distinct point attributes and feature inconsistencies.
Our method outperforms the state-of-the-art works by large margins on four challenging UDA-SS benchmarks.
arXiv Detail & Related papers (2023-11-22T09:18:49Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Multi-Frame Self-Supervised Depth with Transformers [33.00363651105475]
We propose a novel transformer architecture for cost volume generation.
We use depth-discretized epipolar sampling to select matching candidates.
We refine predictions through a series of self- and cross-attention layers.
arXiv Detail & Related papers (2022-04-15T19:04:57Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - Point-Level Region Contrast for Object Detection Pre-Training [147.47349344401806]
We present point-level region contrast, a self-supervised pre-training approach for the task of object detection.
Our approach performs contrastive learning by directly sampling individual point pairs from different regions.
Compared to an aggregated representation per region, our approach is more robust to the change in input region quality.
arXiv Detail & Related papers (2022-02-09T18:56:41Z) - Fine-grained Semantics-aware Representation Enhancement for
Self-supervised Monocular Depth Estimation [16.092527463250708]
We propose novel ideas to improve self-supervised monocular depth estimation.
We focus on incorporating implicit semantic knowledge into geometric representation enhancement.
We evaluate our methods on the KITTI dataset and demonstrate that our method outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-08-19T17:50:51Z) - Oriented RepPoints for Aerial Object Detection [10.818838437018682]
In this paper, we propose a novel approach to aerial object detection, named Oriented RepPoints.
Specifically, we suggest to employ a set of adaptive points to capture the geometric and spatial information of the arbitrary-oriented objects.
To facilitate the supervised learning, the oriented conversion function is proposed to explicitly map the adaptive point set into an oriented bounding box.
arXiv Detail & Related papers (2021-05-24T06:18:23Z) - Video Salient Object Detection via Adaptive Local-Global Refinement [7.723369608197167]
Video salient object detection (VSOD) is an important task in many vision applications.
We propose an adaptive local-global refinement framework for VSOD.
We show that our weighting methodology can further exploit the feature correlations, thus driving the network to learn more discriminative feature representation.
arXiv Detail & Related papers (2021-04-29T14:14:11Z) - Domain Adaptive Semantic Segmentation with Self-Supervised Depth
Estimation [84.34227665232281]
Domain adaptation for semantic segmentation aims to improve the model performance in the presence of a distribution shift between source and target domain.
We leverage the guidance from self-supervised depth estimation, which is available on both domains, to bridge the domain gap.
We demonstrate the effectiveness of our proposed approach on the benchmark tasks SYNTHIA-to-Cityscapes and GTA-to-Cityscapes.
arXiv Detail & Related papers (2021-04-28T07:47:36Z) - Semantic-Guided Representation Enhancement for Self-supervised Monocular
Trained Depth Estimation [39.845944724079814]
Self-supervised depth estimation has shown its great effectiveness in producing high quality depth maps given only image sequences as input.
However, its performance usually drops when estimating on border areas or objects with thin structures due to the limited depth representation ability.
We propose a semantic-guided depth representation enhancement method, which promotes both local and global depth feature representations.
arXiv Detail & Related papers (2020-12-15T02:24:57Z) - Spatial Attention Pyramid Network for Unsupervised Domain Adaptation [66.75008386980869]
Unsupervised domain adaptation is critical in various computer vision tasks.
We design a new spatial attention pyramid network for unsupervised domain adaptation.
Our method performs favorably against the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2020-03-29T09:03:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.