Related papers: Atomizer: Generalizing to new modalities by breaking satellite images down to a set of scalars

Atomizer: Generalizing to new modalities by breaking satellite images down to a set of scalars

URL: http://arxiv.org/abs/2506.13542v2
Date: Tue, 09 Sep 2025 09:27:04 GMT
Title: Atomizer: Generalizing to new modalities by breaking satellite images down to a set of scalars
Authors: Hugo Riffaud de Turckheim, Sylvain Lobry, Roberto Interdonato, Diego Marcos,
Abstract summary: Existing models rely on fixed input formats and modality-specific encoders, which require retraining when new configurations are introduced.<n>We introduce Atomizer, a flexible architecture that represents remote sensing images as sets of tokens, each corresponding to a spectral band value of a pixel.<n>Atomizer outperforms standard models and demonstrates robust performance across varying resolutions and spatial sizes.
Score: 9.925465775310181
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The growing number of Earth observation satellites has led to increasingly diverse remote sensing data, with varying spatial, spectral, and temporal configurations. Most existing models rely on fixed input formats and modality-specific encoders, which require retraining when new configurations are introduced, limiting their ability to generalize across modalities. We introduce Atomizer, a flexible architecture that represents remote sensing images as sets of scalars, each corresponding to a spectral band value of a pixel. Each scalar is enriched with contextual metadata (acquisition time, spatial resolution, wavelength, and bandwidth), producing an atomic representation that allows a single encoder to process arbitrary modalities without interpolation or resampling. Atomizer uses structured tokenization with Fourier features and non-uniform radial basis functions to encode content and context, and maps tokens into a latent space via cross-attention. Under modality-disjoint evaluations, Atomizer outperforms standard models and demonstrates robust performance across varying resolutions and spatial sizes.

Related papers

Universal Pansharpening Foundation Model [67.10467574892282]
Pansharpening generates the high-resolution multi-spectral (MS) image by integrating spatial details from a texture-rich panchromatic (PAN) image and spectral attributes from a low-resolution MS image.<n>We present FoundPS, a universal pansharpening foundation model for satellite-agnostic and scene-robust fusion.
arXiv Detail & Related papers (2026-03-04T08:30:15Z)
SONIC: Spectral Oriented Neural Invariant Convolutions [0.0]
Convolutional Neural Networks (CNNs) rely on fixed-size kernels scanning local patches.<n>ViTs provide global connectivity but lack spatial inductive bias, depend on explicit positional encodings, and remain tied to the initial patch size.<n>We introduce SONIC, a continuous spectral parameterisation that models convolutional operators using a small set of shared, orientation-selective components.
arXiv Detail & Related papers (2026-01-27T18:51:11Z)
The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding [82.53463660564933]
semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders retain high-frequency information that conveys fine-grained detail.<n>We propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator.
arXiv Detail & Related papers (2025-12-22T18:59:57Z)
Any-Optical-Model: A Universal Foundation Model for Optical Remote Sensing [24.03278912134978]
We propose Any Optical Model (AOM) to accommodate arbitrary band compositions, sensor types, and resolution scales.<n>AOM consistently achieves state-of-the-art (SOTA) performance under challenging conditions such as band missing, cross sensor, and cross resolution settings.
arXiv Detail & Related papers (2025-12-19T04:21:01Z)
RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation [12.826798868837557]
RAMEN is a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data.<n>We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources.<n> RAMEN outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark.
arXiv Detail & Related papers (2025-12-04T17:40:17Z)
DisentangleFormer: Spatial-Channel Decoupling for Multi-Channel Vision [10.378378296066305]
Vision Transformers face a fundamental limitation: standard self-attention jointly processes spatial and channel dimensions.<n>We propose DisentangleFormer, an architecture that achieves robust multi-channel vision representation through principled spatial-channel decoupling.<n>Our design integrates three core components: (1) Parallel Disentanglement: Independently processes spatial-token and channel-token streams, enabling decorrelated feature learning across spatial and spectral dimensions, (2) Squeezed Token Enhancer: An adaptive calibration module that dynamically fuses spatial and channel streams, and (3) Multi-Scale FFN: complementing global attention with multi-scale local context.
arXiv Detail & Related papers (2025-12-03T23:03:56Z)
AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection [58.67129770371016]
We propose a novel IRSTD framework that reimagines the IRSTD paradigm by incorporating textual metadata for scene-aware optimization.<n>AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy.
arXiv Detail & Related papers (2025-05-21T07:02:05Z)
Spatial-Temporal-Spectral Unified Modeling for Remote Sensing Dense Prediction [20.1863553357121]
Current deep learning architectures for remote sensing are fundamentally rigid.<n>We introduce the Spatial-Temporal-Spectral Unified Network (STSUN) for unified modeling.<n> STSUN can adapt to input and output data with arbitrary spatial sizes, temporal lengths, and spectral bands.<n>It unifies various dense prediction tasks and diverse semantic class predictions.
arXiv Detail & Related papers (2025-05-18T07:39:17Z)
CARL: Camera-Agnostic Representation Learning for Spectral Image Analysis [75.25966323298003]
Spectral imaging offers promising applications across diverse domains, including medicine and urban scene understanding.<n> variability in channel dimensionality and captured wavelengths among spectral cameras impede the development of AI-driven methodologies.<n>We introduce $textbfCARL$, a model for $textbfC$amera-$textbfA$gnostic $textbfR$esupervised $textbfL$ across RGB, multispectral, and hyperspectral imaging modalities.
arXiv Detail & Related papers (2025-04-27T13:06:40Z)
FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z)
Mixed-granularity Implicit Representation for Continuous Hyperspectral Compressive Reconstruction [16.975538181162616]
This study introduces a novel method using implicit neural representation for continuous hyperspectral image reconstruction.<n>By leveraging implicit neural representations, the MGIR framework enables reconstruction at any desired spatial-spectral resolution.
arXiv Detail & Related papers (2025-03-17T03:37:42Z)
Galileo: Learning Global & Local Features of Many Remote Sensing Modalities [34.71460539414284]
We present a novel self-supervised learning algorithm that extracts multi-scale features across a flexible set of input modalities through masked modeling.<n>Our Galileo is a single generalist model that outperforms SoTA specialist models for satellite images and pixel time series across eleven benchmarks and multiple tasks.
arXiv Detail & Related papers (2025-02-13T14:21:03Z)
CrossModalityDiffusion: Multi-Modal Novel View Synthesis with Unified Intermediate Representation [0.5242869847419834]
CrossModalityDiffusion is a modular framework designed to generate images across different modalities without prior knowledge of scene geometry.<n>We show that jointly training different modules ensures consistent geometric understanding across all modalities within the framework.<n>We validate CrossModalityDiffusion's capabilities on the synthetic ShapeNet cars dataset.
arXiv Detail & Related papers (2025-01-16T20:56:32Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
Locality-Aware Generalizable Implicit Neural Representation [54.93702310461174]
Generalizable implicit neural representation (INR) enables a single continuous function to represent multiple data instances. We propose a novel framework for generalizable INR that combines a transformer encoder with a locality-aware INR decoder. Our framework significantly outperforms previous generalizable INRs and validates the usefulness of the locality-aware latents for downstream tasks.
arXiv Detail & Related papers (2023-10-09T11:26:58Z)
Learning Enriched Features for Real Image Restoration and Enhancement [166.17296369600774]
convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task. We present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network. Our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
arXiv Detail & Related papers (2020-03-15T11:04:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.