Related papers: Multi-Modal Masked Autoencoders for Learning Image-Spectrum Associations for Galaxy Evolution and Cosmology

Multi-Modal Masked Autoencoders for Learning Image-Spectrum Associations for Galaxy Evolution and Cosmology

URL: http://arxiv.org/abs/2510.22527v1
Date: Sun, 26 Oct 2025 04:29:13 GMT
Title: Multi-Modal Masked Autoencoders for Learning Image-Spectrum Associations for Galaxy Evolution and Cosmology
Authors: Morgan Himes, Samiksha Krishnamurthy, Andrew Lizarraga, Srinath Saikrishnan, Vikram Seenivasan, Jonathan Soriano, Ying Nian Wu, Tuan Do,
Abstract summary: We build a dataset of 134,533 galaxy images (HSC-PDR2) and spectra (DESI-DR1)<n>We adapt a Multi-Modal Masked Autoencoder to embed both images and spectra in a shared representation.<n>We use this model to test three applications: spectral and image reconstruction from heavily masked data and redshift regression from images alone.
Score: 29.09392720573202
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Upcoming surveys will produce billions of galaxy images but comparatively few spectra, motivating models that learn cross-modal representations. We build a dataset of 134,533 galaxy images (HSC-PDR2) and spectra (DESI-DR1) and adapt a Multi-Modal Masked Autoencoder (MMAE) to embed both images and spectra in a shared representation. The MMAE is a transformer-based architecture, which we train by masking 75% of the data and reconstructing missing image and spectral tokens. We use this model to test three applications: spectral and image reconstruction from heavily masked data and redshift regression from images alone. It recovers key physical features, such as galaxy shapes, atomic emission line peaks, and broad continuum slopes, though it struggles with fine image details and line strengths. For redshift regression, the MMAE performs comparably or better than prior multi-modal models in terms of prediction scatter even when missing spectra in testing. These results highlight both the potential and limitations of masked autoencoders in astrophysics and motivate extensions to additional modalities, such as text, for foundation models.

Related papers

SpectralAR: Spectral Autoregressive Visual Generation [74.48368364895387]
We propose a Spectral AutoRegressive visual generation framework, which realizes causality for visual sequences from the spectral perspective.<n>By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles.
arXiv Detail & Related papers (2025-06-12T17:57:44Z)
HyperspectralMAE: The Hyperspectral Imagery Classification Model using Fourier-Encoded Dual-Branch Masked Autoencoder [0.04332259966721321]
Hyperspectral imagery provides rich spectral detail but poses unique challenges because of its high dimensionality in both spatial and spectral domains.<n>We propose textitHyperspectralMAE, a Transformer-based model for hyperspectral data that employs a textitdual masking strategy.<n>HyperspectralMAE achieves state-of-the-art transfer-learning accuracy on Indian Pines, confirming that masked dual-dimensional pre-training yields robust spectral-spatial representations.
arXiv Detail & Related papers (2025-05-09T01:16:42Z)
SpectralGPT: Spectral Remote Sensing Foundation Model [60.023956954916414]
A universal RS foundation model, named SpectralGPT, is purpose-built to handle spectral RS images using a novel 3D generative pretrained transformer (GPT) Compared to existing foundation models, SpectralGPT accommodates input images with varying sizes, resolutions, time series, and regions in a progressive training fashion, enabling full utilization of extensive RS big data. Our evaluation highlights significant performance improvements with pretrained SpectralGPT models, signifying substantial potential in advancing spectral RS big data applications within the field of geoscience.
arXiv Detail & Related papers (2023-11-13T07:09:30Z)
SSG2: A new modelling paradigm for semantic segmentation [0.0]
State-of-the-art models in semantic segmentation operate on single, static images, generating corresponding segmentation masks. Inspired by work on semantic change detection, we introduce a methodology that leverages a sequence of observables generated for each static input image. By adding this "temporal" dimension, we exploit strong signal correlations between successive observations in the sequence to reduce error rates. We evaluate SSG2 across three diverse datasets: UrbanMonitor, featuring orthoimage tiles from Darwin, Australia with five spectral bands and 0.2m spatial resolution; ISPRS Potsdam, which includes true orthophoto images with multiple spectral bands and a 5cm ground sampling
arXiv Detail & Related papers (2023-10-12T19:08:03Z)
AstroCLIP: A Cross-Modal Foundation Model for Galaxies [40.43521617393482]
AstroCLIP embeds galaxy images and spectra separately by pretraining separate transformer-based image and spectrum encoders in self-supervised settings. We find remarkable performance on all downstream tasks, even relative to supervised baselines. Our approach represents the first cross-modal self-supervised model for galaxies, and the first self-supervised transformer-based architectures for galaxy images and spectra.
arXiv Detail & Related papers (2023-10-04T17:59:38Z)
Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook. We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z)
Multi-scale Transformer Network with Edge-aware Pre-training for Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones. Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model. We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z)
Stare at What You See: Masked Image Modeling without Reconstruction [154.74533119863864]
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training. Recent approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance. We argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.
arXiv Detail & Related papers (2022-11-16T12:48:52Z)
MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction [148.26195175240923]
We propose a novel Transformer-based method, Multi-stage Spectral-wise Transformer (MST++) for efficient spectral reconstruction. In the NTIRE 2022 Spectral Reconstruction Challenge, our approach won the First place.
arXiv Detail & Related papers (2022-04-17T02:39:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.