Related papers: EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data

EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data

URL: http://arxiv.org/abs/2602.12177v1
Date: Thu, 12 Feb 2026 17:09:14 GMT
Title: EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data
Authors: Nils Lehmann, Yi Wang, Zhitong Xiong, Xiaoxiang Zhu,
Abstract summary: State-of-the-art generative image and video models rely heavily on tokenizers that compress high-dimensional inputs into more efficient latent representations.<n>We propose EO-VAE, a multi-sensor variational autoencoder designed to serve as a foundational tokenizer for the Earth observation domain.
Score: 19.18955300820542
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-of-the-art generative image and video models rely heavily on tokenizers that compress high-dimensional inputs into more efficient latent representations. While this paradigm has revolutionized RGB generation, Earth observation (EO) data presents unique challenges due to diverse sensor specifications and variable spectral channels. We propose EO-VAE, a multi-sensor variational autoencoder designed to serve as a foundational tokenizer for the EO domain. Unlike prior approaches that train separate tokenizers for each modality, EO-VAE utilizes a single model to encode and reconstruct flexible channel combinations via dynamic hypernetworks. Our experiments on the TerraMesh dataset demonstrate that EO-VAE achieves superior reconstruction fidelity compared to the TerraMind tokenizers, establishing a robust baseline for latent generative modeling in remote sensing.

Related papers

Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation [0.0]
Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic.<n>We propose a dual-teacher contrastive distillation framework for multispectral imagery.<n>Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning.
arXiv Detail & Related papers (2026-02-23T14:09:01Z)
Future Optical Flow Prediction Improves Robot Control & Video Generation [100.87884718953099]
We introduce FOFPred, a novel optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture.<n>Our model is trained on web-scale human activity data-a highly scalable but unstructured source.<n> Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred.
arXiv Detail & Related papers (2026-01-15T18:49:48Z)
OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving [58.693329943871355]
We propose OminiGen, which generates aligned multimodal sensor data in a unified framework.<n>Our approach leverages a shared Birdu 2019s Eye View (BEV) space to unify multimodal features.<n>UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction.
arXiv Detail & Related papers (2025-12-16T09:18:15Z)
Latent Dirichlet Transformer VAE for Hyperspectral Unmixing with Bundled Endmembers [1.9336815376402718]
We propose the Latent Dirichlet Transformer Variational Autoencoder (LDVAE-T) for hyperspectral unmixing.<n>Our model combines the global context modeling capabilities of transformer architectures with physically meaningful constraints imposed by a Dirichlet prior in the latent space.<n>We evaluate our approach on three benchmark datasets, Samson, Jasper Ridge, and HYDICE Urban.
arXiv Detail & Related papers (2025-11-21T20:15:37Z)
HyperAIRI: a plug-and-play algorithm for precise hyperspectral image reconstruction in radio interferometry [9.387735688431862]
We introduce HyperAIRI, its hyperspectral extension, underpinned by learned hyperspectral denoisers enforcing a power-law spectral model.<n>For each spectral channel, the HyperAIRI denoiser takes as input its current image estimate, alongside estimates of its two immediate neighbouring channels and the spectral index map.<n>To accommodate varying dynamic ranges, we assemble a shelf of pre-trained denoisers, each tailored to a specific dynamic range.
arXiv Detail & Related papers (2025-10-16T23:49:20Z)
EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM [103.7537991413311]
Earth Observation (EO) data analysis is vital for monitoring environmental and human dynamics.<n>Recent Multimodal Large Language Models (MLLMs) show potential in EO understanding but remain restricted to single-sensor inputs.<n>We propose EarthMind, a unified vision-language framework that handles both single- and cross-sensor inputs.
arXiv Detail & Related papers (2025-06-02T13:36:05Z)
Exploring Representation-Aligned Latent Space for Better Generation [86.45670422239317]
We introduce ReaLS, which integrates semantic priors to improve generation performance.<n>We show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric.<n>The enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
arXiv Detail & Related papers (2025-02-01T07:42:12Z)
Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation [47.52225194259896]
We propose a unified, multimodal foundation framework designed for diverse vision tasks in Earth observation (EO)<n>Inspired by neural plasticity, DOFA utilizes a wavelength-conditioned dynamic hypernetwork to process inputs from five distinct satellite sensors flexibly.<n>We show DOFA's potential as a foundation for general-purpose vision models in the sensor-diverse EO domain.
arXiv Detail & Related papers (2024-03-22T17:11:47Z)
DiffiT: Diffusion Vision Transformers for Image Generation [88.08529836125399]
Vision Transformer (ViT) has demonstrated strong modeling capabilities and scalability, especially for recognition tasks. We study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT) DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency.
arXiv Detail & Related papers (2023-12-04T18:57:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.