Related papers: LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer

LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer

URL: http://arxiv.org/abs/2511.22812v1
Date: Thu, 27 Nov 2025 23:56:35 GMT
Title: LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer
Authors: Kai Wang, Siyi Chen, Weicong Pang, Chenchen Zhang, Renjun Gao, Ziru Chen, Cheng Li, Dasa Gu, Rui Huang, Alexis Kai Hon Lau,
Abstract summary: LC4-DViT is a framework that combines generative data creation with a deformation-aware Vision Transformer.<n>A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions to synthesize high-fidelity training images.<n>DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context.
Score: 14.684808109822386
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely, accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification offers a scalable route to such maps but is hindered by scarce and imbalanced annotations and by geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from the Aerial Image Dataset (AID)-Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River-DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen' s Kappa, improving over a vanilla ViT baseline (0.9274 OA, 0.9300 macro F1, 0.9169 Kappa) and outperforming ResNet50, MobileNetV2, and FlashInternImage. Cross-dataset experiments on a three-class SIRI-WHU subset (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability. An LLM-based judge using GPT-4o to score Grad-CAM heatmaps further shows that DViT' s attention aligns best with hydrologically meaningful structures. These results suggest that description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping.

Related papers

StereoAdapter-2: Globally Structure-Consistent Underwater Stereo Depth Estimation [18.410248448681514]
We propose StereoAdapter-2, which replaces the conventional ConvGRU updater with a novel ConvSS2D operator.<n>We construct UW-StereoDepth-80K, a large-scale synthetic underwater stereo dataset.<n>Our framework achieves state-of-the-art zero-shot performance on underwater benchmarks.
arXiv Detail & Related papers (2026-02-18T22:12:08Z)
Pix2Geomodel: A Next-Generation Reservoir Geomodeling with Property-to-Property Translation [2.004012818482403]
This study introduces Pix2Geomodel, a novel conditional generative adversarial network (cGAN) framework based on Pix2Pix.<n>It is designed to predict reservoir properties (facies, porosity, permeability, and water saturation) from the Rotliegend reservoir of the Groningen gas field.<n>Results demonstrated high accuracy for facies (PA 0.88, FWIoU 0.85) and water saturation (PA 0.96, FWIoU 0.95), with moderate success for porosity (PA 0.70, FWIoU 0.55) and permeability (PA 0.74, FWIoU 0.60), and robust translation performance
arXiv Detail & Related papers (2025-06-21T15:58:27Z)
Pan-Arctic Permafrost Landform and Human-built Infrastructure Feature Detection with Vision Transformers and Location Embeddings [1.2895931807247418]
Vision Transformers (ViTs) offer advantages in capturing long-range dependencies and global context via attention mechanisms.<n>ViTs support pretraining via self-supervised learning-addressing the common limitation of labeled data in Arctic feature detection.<n>This work investigates: (1) the suitability of pre-trained ViTs as feature extractors for high-resolution Arctic remote sensing tasks, and (2) the benefit of combining image and location embeddings.
arXiv Detail & Related papers (2025-06-03T13:34:01Z)
STRMs: Spatial Temporal Reasoning Models for Vision-Based Localization Rivaling GPS Precision [3.671692919685993]
We introduce two sequential generative models, VAE-RNN and VAE-Transformer, which transform first-person perspective observations into global map perspective representations.<n>We evaluate these models across two real-world environments: a university campus navigated by a Jackal robot and an urban downtown area navigated by a Tesla sedan.
arXiv Detail & Related papers (2025-03-11T00:38:54Z)
DeformUX-Net: Exploring a 3D Foundation Backbone for Medical Image Segmentation with Depthwise Deformable Convolution [26.746489317083352]
We introduce 3D DeformUX-Net, a pioneering volumetric CNN model. We revisit volumetric deformable convolution in depth-wise setting to adapt long-range dependency with computational efficiency. Our empirical evaluations reveal that the 3D DeformUX-Net consistently outperforms existing state-of-the-art ViTs and large kernel convolution models.
arXiv Detail & Related papers (2023-09-30T00:33:41Z)
Vision Transformers, a new approach for high-resolution and large-scale mapping of canopy heights [50.52704854147297]
We present a new vision transformer (ViT) model optimized with a classification (discrete) and a continuous loss function. This model achieves better accuracy than previously used convolutional based approaches (ConvNets) optimized with only a continuous loss function.
arXiv Detail & Related papers (2023-04-22T22:39:03Z)
Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z)
Adaptive Split-Fusion Transformer [90.04885335911729]
We propose an Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights. Experiments on standard benchmarks, such as ImageNet-1K, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
arXiv Detail & Related papers (2022-04-26T10:00:28Z)
DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation. We propose to leverage the Transformer to model this global context with an effective attention mechanism. Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z)
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE. ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context. We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning [68.55487598401788]
Recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers. We propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution self-attention in a concise transformer format. We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2. Our UniFormer achieves 8/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods
arXiv Detail & Related papers (2022-01-12T20:02:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.