Related papers: ChromaFormer: A Scalable and Accurate Transformer Architecture for Land Cover Classification

ChromaFormer: A Scalable and Accurate Transformer Architecture for Land Cover Classification

URL: http://arxiv.org/abs/2503.08534v1
Date: Tue, 11 Mar 2025 15:24:50 GMT
Title: ChromaFormer: A Scalable and Accurate Transformer Architecture for Land Cover Classification
Authors: Mingshi Li, Dusan Grujicic, Ben Somers, Stien Heremans, Steven De Saeger, Matthew B. Blaschko,
Abstract summary: We propose a family of multi-spectral transformer models, which we evaluate across orders of magnitude differences in model parameters.<n>We show that models many orders of magnitude larger than conventional architectures, such as UNet, lead to substantial accuracy improvements.
Score: 11.348747673057405
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Remote sensing imagery from systems such as Sentinel provides full coverage of the Earth's surface at around 10-meter resolution. The remote sensing community has transitioned to extensive use of deep learning models due to their high performance on benchmarks such as the UCMerced and ISPRS Vaihingen datasets. Convolutional models such as UNet and ResNet variations are commonly employed for remote sensing but typically only accept three channels, as they were developed for RGB imagery, while satellite systems provide more than ten. Recently, several transformer architectures have been proposed for remote sensing, but they have not been extensively benchmarked and are typically used on small datasets such as Salinas Valley. Meanwhile, it is becoming feasible to obtain dense spatial land-use labels for entire first-level administrative divisions of some countries. Scaling law observations suggest that substantially larger multi-spectral transformer models could provide a significant leap in remote sensing performance in these settings. In this work, we propose ChromaFormer, a family of multi-spectral transformer models, which we evaluate across orders of magnitude differences in model parameters to assess their performance and scaling effectiveness on a densely labeled imagery dataset of Flanders, Belgium, covering more than 13,500 km^2 and containing 15 classes. We propose a novel multi-spectral attention strategy and demonstrate its effectiveness through ablations. Furthermore, we show that models many orders of magnitude larger than conventional architectures, such as UNet, lead to substantial accuracy improvements: a UNet++ model with 23M parameters achieves less than 65% accuracy, while a multi-spectral transformer with 655M parameters achieves over 95% accuracy on the Biological Valuation Map of Flanders.

Related papers

Wavelet-Guided Dual-Frequency Encoding for Remote Sensing Change Detection [67.84730634802204]
Change detection in remote sensing imagery plays a vital role in various engineering applications, such as natural disaster monitoring, urban expansion tracking, and infrastructure management.<n>Most existing methods still rely on spatial-domain modeling, where the limited diversity of feature representations hinders the detection of subtle change regions.<n>We observe that frequency-domain feature modeling particularly in the wavelet domain amplify fine-grained differences in frequency components, enhancing the perception of edge changes that are challenging to capture in the spatial domain.
arXiv Detail & Related papers (2025-08-07T11:14:16Z)
OmniUnet: A Multimodal Network for Unstructured Terrain Segmentation on Planetary Rovers Using RGB, Depth, and Thermal Imagery [0.5837061763460748]
This work presents OmniUnet, a transformer-based neural network architecture for semantic segmentation using RGB, depth, and thermal imagery.<n>A custom multimodal sensor housing was developed using 3D printing and mounted on the Martian Rover Testbed for Autonomy.<n>A subset of this dataset was manually labeled to support supervised training of the network.<n>Inference tests yielded an average prediction time of 673 ms on a resource-constrained computer.
arXiv Detail & Related papers (2025-08-01T12:23:29Z)
RadMamba: Efficient Human Activity Recognition through Radar-based Micro-Doppler-Oriented Mamba State-Space Model [4.200994756075655]
This paper introduces RadMamba, a parameter-efficient, radar micro-Doppler-oriented Mamba SSM specifically tailored for radar-based HAR.<n>Across three diverse datasets, RadMamba matches the top-performing previous model's 99.8% classification accuracy.
arXiv Detail & Related papers (2025-04-16T12:54:11Z)
Low-Level Matters: An Efficient Hybrid Architecture for Robust Multi-frame Infrared Small Target Detection [5.048364655933007]
Multi-frame infrared small target detection plays a crucial role in low-altitude and maritime surveillance.<n>The hybrid architecture combining CNNs and Transformers shows great promise for enhancing multi-frame IRSTD.<n>We propose LVNet, a simple yet powerful hybrid architecture that redefines low-level feature learning hybrid frameworks.
arXiv Detail & Related papers (2025-03-04T02:53:25Z)
LapGSR: Laplacian Reconstructive Network for Guided Thermal Super-Resolution [1.747623282473278]
Fusing multiple modalities to produce high-resolution images often requires dense models with millions of parameters and a heavy computational load. We propose LapGSR, a multimodal, lightweight, generative model incorporating Laplacian image pyramids for guided thermal super-resolution.
arXiv Detail & Related papers (2024-11-12T12:23:19Z)
HorGait: A Hybrid Model for Accurate Gait Recognition in LiDAR Point Cloud Planar Projections [8.56443762544299]
HorGait is a hybrid model with a Transformer architecture for gait recognition on the planar projection of 3D point clouds from LiDAR. It achieves state-of-the-art performance among Transformer architecture methods on the SUSTech1K dataset.
arXiv Detail & Related papers (2024-10-11T02:12:41Z)
Improving satellite imagery segmentation using multiple Sentinel-2 revisits [0.0]
We explore the best way to use revisits in the framework of fine-tuning pre-trained remote sensing models. We find that fusing representations from multiple revisits in the model latent space is superior to other methods of using revisits. A SWIN Transformer-based architecture performs better than U-nets and ViT-based models.
arXiv Detail & Related papers (2024-09-25T21:13:33Z)
Energy-Based Models for Cross-Modal Localization using Convolutional Transformers [52.27061799824835]
We present a novel framework for localizing a ground vehicle mounted with a range sensor against satellite imagery in the absence of GPS. We propose a method using convolutional transformers that performs accurate metric-level localization in a cross-modal manner. We train our model end-to-end and demonstrate our approach achieving higher accuracy than the state-of-the-art on KITTI, Pandaset, and a custom dataset.
arXiv Detail & Related papers (2023-06-06T21:27:08Z)
Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks. Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention. Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z)
Graph Neural Networks Extract High-Resolution Cultivated Land Maps from Sentinel-2 Image Series [33.10103896300028]
We introduce an approach for extracting 2.5 m cultivated land maps from 10 m Sentinel-2 multispectral image series. The experiments indicate that our models not only outperform classical and deep machine learning techniques through delivering higher-quality segmentation maps. Such memory frugality is pivotal in the missions which allow us to uplink a model to the AI-powered satellite once it is in orbit.
arXiv Detail & Related papers (2022-08-03T21:19:06Z)
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE. ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context. We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD) It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking [74.82415271960315]
We propose a solution named TransMOT to efficiently model the spatial and temporal interactions among objects in a video. TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy. The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20.
arXiv Detail & Related papers (2021-04-01T01:49:05Z)
DA-Transformer: Distance-aware Transformer [87.20061062572391]
DA-Transformer is a distance-aware Transformer that can exploit the real distance. In this paper, we propose DA-Transformer, which is a distance-aware Transformer that can exploit the real distance.
arXiv Detail & Related papers (2020-10-14T10:09:01Z)
Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture. We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions. Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.