Related papers: Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation

Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation

URL: http://arxiv.org/abs/2602.19863v2
Date: Tue, 24 Feb 2026 11:57:03 GMT
Title: Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation
Authors: Filip Wolf, Blaž Rolih, Luka Čehovin Zajc,
Abstract summary: Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic.<n>We propose a dual-teacher contrastive distillation framework for multispectral imagery.<n>Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student's pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources. Project page: \textcolor{magenta}{https://wolfilip.github.io/DEO/}.

Related papers

Beyond Language Modeling: An Exploration of Multimodal Pretraining [125.34714978184638]
We provide empirical clarity through controlled, from-scratch pretraining experiments.<n>We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision.<n>We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language.
arXiv Detail & Related papers (2026-03-03T18:58:00Z)
AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model [23.785186661138734]
We study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost.<n>We introduce Agglomerative Mixture-of-Experts Vision Foundation Models (AMoE), which distill knowledge from SigLIP2 and DINOv3 simultaneously into a Mixture-of-Experts student.<n>We show that our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer.
arXiv Detail & Related papers (2025-12-23T08:37:11Z)
WEDepth: Efficient Adaptation of World Knowledge for Monocular Depth Estimation [4.654162664140336]
Modern Vision Foundation Models (VFMs), pre-trained on large-scale diverse datasets, exhibit remarkable world understanding capabilities.<n>We propose WEDepth, a novel approach that adapts VFMs for MDE without modi-fying their structures and pretrained weights.<n>Our method employs the VFM as a multi-level feature en-hancer, systematically injecting prior knowledge at differ-ent representation levels.
arXiv Detail & Related papers (2025-11-11T09:41:27Z)
Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models [84.30626369903221]
We propose Dynamic Pattern Alignment Learning (DPAL) to efficiently train lightweight human-centric vision models.<n>The DPAL guides lightweight HVMs to learn all typical human visual patterns from large HVMs, which can generalize to various human-centric vision tasks.<n>Extensive experiments conducted on 15 challenging datasets demonstrate the effectiveness of the DPAL.
arXiv Detail & Related papers (2025-08-10T02:27:06Z)
TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation [65.74990259650984]
We introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery.<n>Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism.<n>TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench.
arXiv Detail & Related papers (2025-06-06T17:59:50Z)
DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion [53.70278210626701]
We propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images.<n>Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame.<n>We empirically validate DiffusionSfM on both synthetic and real datasets, demonstrating that it outperforms classical and learning-based approaches.
arXiv Detail & Related papers (2025-05-08T17:59:47Z)
DMT: Comprehensive Distillation with Multiple Self-supervised Teachers [27.037140667247208]
We introduce Comprehensive Distillation with Multiple Self-supervised Teachers (DMT) for pretrained model compression. Our experimental results on prominent benchmark datasets exhibit that the proposed method significantly surpasses state-of-the-art competitors.
arXiv Detail & Related papers (2023-12-19T08:31:30Z)
AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One [47.58919672657824]
We name this approach AM-RADIO (Agglomerative Model -- Reduce All Domains Into One) We develop a novel architecture (E-RADIO) that exceeds the performance of its predecessors and is at least 7x faster than the teacher models. Our comprehensive benchmarking process covers downstream tasks including ImageNet classification, ADE20k semantic segmentation, COCO object detection and LLaVa-1.5 framework.
arXiv Detail & Related papers (2023-12-10T17:07:29Z)
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC) UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z)
Self-Supervised Monocular Depth Estimation with Self-Reference Distillation and Disparity Offset Refinement [15.012694052674899]
We propose two novel ideas to improve self-supervised monocular depth estimation. We use a parameter-optimized model as the teacher updated as the training epochs to provide additional supervision. We leverage the contextual consistency between high-scale and low-scale features to obtain multiscale disparity offsets.
arXiv Detail & Related papers (2023-02-20T06:28:52Z)
UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks. In contrast to previous models, UViM has the same functional form for all tasks. We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z)
Contrastive Multiview Coding with Electro-optics for SAR Semantic Segmentation [0.6445605125467573]
We propose multi-modal representation learning for SAR semantic segmentation. Unlike previous studies, our method jointly uses EO imagery, SAR imagery, and a label mask. Several experiments show that our approach is superior to the existing methods in model performance, sample efficiency, and convergence speed.
arXiv Detail & Related papers (2021-08-31T23:55:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.