HeightFormer: A Multilevel Interaction and Image-adaptive
Classification-regression Network for Monocular Height Estimation with Aerial
Images
- URL: http://arxiv.org/abs/2310.07995v1
- Date: Thu, 12 Oct 2023 02:49:00 GMT
- Title: HeightFormer: A Multilevel Interaction and Image-adaptive
Classification-regression Network for Monocular Height Estimation with Aerial
Images
- Authors: Zhan Chen and Yidan Zhang and Xiyu Qi and Yongqiang Mao and Xin Zhou
and Lulu Niu and Hui Wu and Lei Wang and Yunping Ge
- Abstract summary: This paper presents a comprehensive solution for monocular height estimation in remote sensing.
It features the Multilevel Interaction Backbone (MIB) and Image-adaptive Classification-regression Height Generator (ICG)
The ICG dynamically generates height partition for each image and reframes the traditional regression task.
- Score: 10.716933766055755
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Height estimation has long been a pivotal topic within measurement and remote
sensing disciplines, proving critical for endeavours such as 3D urban
modelling, MR and autonomous driving. Traditional methods utilise stereo
matching or multisensor fusion, both well-established techniques that typically
necessitate multiple images from varying perspectives and adjunct sensors like
SAR, leading to substantial deployment costs. Single image height estimation
has emerged as an attractive alternative, boasting a larger data source variety
and simpler deployment. However, current methods suffer from limitations such
as fixed receptive fields, a lack of global information interaction, leading to
noticeable instance-level height deviations. The inherent complexity of height
prediction can result in a blurry estimation of object edge depth when using
mainstream regression methods based on fixed height division. This paper
presents a comprehensive solution for monocular height estimation in remote
sensing, termed HeightFormer, combining multilevel interactions and
image-adaptive classification-regression. It features the Multilevel
Interaction Backbone (MIB) and Image-adaptive Classification-regression Height
Generator (ICG). MIB supplements the fixed sample grid in CNN of the
conventional backbone network with tokens of different interaction ranges. It
is complemented by a pixel-, patch-, and feature map-level hierarchical
interaction mechanism, designed to relay spatial geometry information across
different scales and introducing a global receptive field to enhance the
quality of instance-level height estimation. The ICG dynamically generates
height partition for each image and reframes the traditional regression task,
using a refinement from coarse to fine classification-regression that
significantly mitigates the innate ill-posedness issue and drastically improves
edge sharpness.
Related papers
- Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers [59.0181939916084]
Traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries.
We propose a novel Priors Distillation (RPD) method to extract priors from the well-trained transformers on massive images.
Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification.
arXiv Detail & Related papers (2024-07-26T06:29:09Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - HTC-DC Net: Monocular Height Estimation from Single Remote Sensing
Images [24.65766848068617]
We propose a method for monocular height estimation from optical imagery.
As an ill-posed problem, monocular height estimation requires well-designed networks for enhanced representations.
We propose HTC-DC Net following the classification-regression paradigm, with the head-tail cut (HTC) and the distribution-based constraints (DCs) as the main contributions.
arXiv Detail & Related papers (2023-09-28T14:50:32Z) - Global-to-Local Modeling for Video-based 3D Human Pose and Shape
Estimation [53.04781510348416]
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness.
We propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT)
Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
arXiv Detail & Related papers (2023-03-26T14:57:49Z) - HiFuse: Hierarchical Multi-Scale Feature Fusion Network for Medical
Image Classification [16.455887856811465]
This paper proposes a three-branch hierarchical multi-scale feature fusion network structure termed as HiFuse for medical image classification.
The accuracy of our proposed model on the ISIC dataset is 7.6% higher than baseline, 21.5% on the Covid-19 dataset, and 10.4% on the Kvasir dataset.
arXiv Detail & Related papers (2022-09-21T09:30:20Z) - Towards Model Generalization for Monocular 3D Object Detection [57.25828870799331]
We present an effective unified camera-generalized paradigm (CGP) for Mono3D object detection.
We also propose the 2D-3D geometry-consistent object scaling strategy (GCOS) to bridge the gap via an instance-level augment.
Our method called DGMono3D achieves remarkable performance on all evaluated datasets and surpasses the SoTA unsupervised domain adaptation scheme.
arXiv Detail & Related papers (2022-05-23T23:05:07Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - AF$_2$: Adaptive Focus Framework for Aerial Imagery Segmentation [86.44683367028914]
Aerial imagery segmentation has some unique challenges, the most critical one among which lies in foreground-background imbalance.
We propose Adaptive Focus Framework (AF$), which adopts a hierarchical segmentation procedure and focuses on adaptively utilizing multi-scale representations.
AF$ has significantly improved the accuracy on three widely used aerial benchmarks, as fast as the mainstream method.
arXiv Detail & Related papers (2022-02-18T10:14:45Z) - Disentangled Latent Transformer for Interpretable Monocular Height
Estimation [15.102260054654923]
We study how deep neural networks predict height from a single monocular image.
Our work provides novel insights for both understanding and designing MHE models.
arXiv Detail & Related papers (2022-01-17T11:42:30Z) - Height estimation from single aerial images using a deep ordinal
regression network [12.991266182762597]
We deal with the ambiguous and unsolved problem of height estimation from a single aerial image.
Driven by the success of deep learning, especially deep convolution neural networks (CNNs), some researches have proposed to estimate height information from a single aerial image.
In this paper, we proposed to divide height values into spacing-increasing intervals and transform the regression problem into an ordinal regression problem.
arXiv Detail & Related papers (2020-06-04T12:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.