Disentangled Latent Transformer for Interpretable Monocular Height
Estimation
- URL: http://arxiv.org/abs/2201.06357v1
- Date: Mon, 17 Jan 2022 11:42:30 GMT
- Title: Disentangled Latent Transformer for Interpretable Monocular Height
Estimation
- Authors: Zhitong Xiong Sining Chen, Yilei Shi, and Xiao Xiang Zhu
- Abstract summary: We study how deep neural networks predict height from a single monocular image.
Our work provides novel insights for both understanding and designing MHE models.
- Score: 15.102260054654923
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Monocular height estimation (MHE) from remote sensing imagery has high
potential in generating 3D city models efficiently for a quick response to
natural disasters. Most existing works pursue higher performance. However,
there is little research exploring the interpretability of MHE networks. In
this paper, we target at exploring how deep neural networks predict height from
a single monocular image. Towards a comprehensive understanding of MHE
networks, we propose to interpret them from multiple levels: 1) Neurons:
unit-level dissection. Exploring the semantic and height selectivity of the
learned internal deep representations; 2) Instances: object-level
interpretation. Studying the effects of different semantic classes, scales, and
spatial contexts on height estimation; 3) Attribution: pixel-level analysis.
Understanding which input pixels are important for the height estimation. Based
on the multi-level interpretation, a disentangled latent Transformer network is
proposed towards a more compact, reliable, and explainable deep model for
monocular height estimation. Furthermore, a novel unsupervised semantic
segmentation task based on height estimation is first introduced in this work.
Additionally, we also construct a new dataset for joint semantic segmentation
and height estimation. Our work provides novel insights for both understanding
and designing MHE models.
Related papers
- Large Spatial Model: End-to-end Unposed Images to Semantic 3D [79.94479633598102]
Large Spatial Model (LSM) processes unposed RGB images directly into semantic radiance fields.
LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation.
It can generate versatile label maps by interacting with language at novel viewpoints.
arXiv Detail & Related papers (2024-10-24T17:54:42Z) - Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer [12.486504395099022]
Self-supervised monocular depth estimation aims to infer depth information without relying on labeled data.
Lack of labeled information poses a significant challenge to the model's representation, limiting its ability to capture the intricate details of the scene accurately.
We introduce a novel self-supervised monocular depth estimation model that leverages multiple priors to bolster representation capabilities across spatial, context, and semantic dimensions.
arXiv Detail & Related papers (2024-06-13T08:51:57Z) - OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation [56.028185293563325]
This paper studies a new open-set problem, the open-vocabulary category-level object pose and size estimation.
We first introduce OO3D-9D, a large-scale photorealistic dataset for this task.
We then propose a framework built on pre-trained DinoV2 and text-to-image stable diffusion models.
arXiv Detail & Related papers (2024-03-19T03:09:24Z) - Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation [20.230238670888454]
We introduce Marigold, a method for affine-invariant monocular depth estimation.
It can be fine-tuned in a couple of days on a single GPU using only synthetic training data.
It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases.
arXiv Detail & Related papers (2023-12-04T18:59:13Z) - HeightFormer: A Multilevel Interaction and Image-adaptive
Classification-regression Network for Monocular Height Estimation with Aerial
Images [10.716933766055755]
This paper presents a comprehensive solution for monocular height estimation in remote sensing.
It features the Multilevel Interaction Backbone (MIB) and Image-adaptive Classification-regression Height Generator (ICG)
The ICG dynamically generates height partition for each image and reframes the traditional regression task.
arXiv Detail & Related papers (2023-10-12T02:49:00Z) - Semi-Weakly Supervised Object Kinematic Motion Prediction [56.282759127180306]
Given a 3D object, kinematic motion prediction aims to identify the mobile parts as well as the corresponding motion parameters.
We propose a graph neural network to learn the map between hierarchical part-level segmentation and mobile parts parameters.
The network predictions yield a large scale of 3D objects with pseudo labeled mobility information.
arXiv Detail & Related papers (2023-03-31T02:37:36Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Depthformer : Multiscale Vision Transformer For Monocular Depth
Estimation With Local Global Information Fusion [6.491470878214977]
This paper benchmarks various transformer-based models for the depth estimation task on an indoor NYUV2 dataset and an outdoor KITTI dataset.
We propose a novel attention-based architecture, Depthformer for monocular depth estimation.
Our proposed method improves the state-of-the-art by 3.3%, and 3.3% respectively in terms of Root Mean Squared Error (RMSE)
arXiv Detail & Related papers (2022-07-10T20:49:11Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - Improving Point Cloud Semantic Segmentation by Learning 3D Object
Detection [102.62963605429508]
Point cloud semantic segmentation plays an essential role in autonomous driving.
Current 3D semantic segmentation networks focus on convolutional architectures that perform great for well represented classes.
We propose a novel Aware 3D Semantic Detection (DASS) framework that explicitly leverages localization features from an auxiliary 3D object detection task.
arXiv Detail & Related papers (2020-09-22T14:17:40Z) - Height estimation from single aerial images using a deep ordinal
regression network [12.991266182762597]
We deal with the ambiguous and unsolved problem of height estimation from a single aerial image.
Driven by the success of deep learning, especially deep convolution neural networks (CNNs), some researches have proposed to estimate height information from a single aerial image.
In this paper, we proposed to divide height values into spacing-increasing intervals and transform the regression problem into an ordinal regression problem.
arXiv Detail & Related papers (2020-06-04T12:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.