Related papers: CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders

CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders

URL: http://arxiv.org/abs/2311.00566v1
Date: Wed, 1 Nov 2023 15:07:27 GMT
Title: CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders
Authors: Anthony Fuller, Koreen Millard, James R. Green
Abstract summary: Remote sensing offers vast yet sparsely labeled, spatially aligned multimodal data. We present CROMA, a framework that combines contrastive and reconstruction self-supervised objectives to learn rich unimodal and multimodal representations.
Score: 2.7624021966289605
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A vital and rapidly growing application, remote sensing offers vast yet sparsely labeled, spatially aligned multimodal data; this makes self-supervised learning algorithms invaluable. We present CROMA: a framework that combines contrastive and reconstruction self-supervised objectives to learn rich unimodal and multimodal representations. Our method separately encodes masked-out multispectral optical and synthetic aperture radar samples -- aligned in space and time -- and performs cross-modal contrastive learning. Another encoder fuses these sensors, producing joint multimodal encodings that are used to predict the masked patches via a lightweight decoder. We show that these objectives are complementary when leveraged on spatially aligned multimodal data. We also introduce X- and 2D-ALiBi, which spatially biases our cross- and self-attention matrices. These strategies improve representations and allow our models to effectively extrapolate to images up to 17.6x larger at test-time. CROMA outperforms the current SoTA multispectral model, evaluated on: four classification benchmarks -- finetuning (avg. 1.8%), linear (avg. 2.4%) and nonlinear (avg. 1.4%) probing, kNN classification (avg. 3.5%), and K-means clustering (avg. 8.4%); and three segmentation benchmarks (avg. 6.4%). CROMA's rich, optionally multimodal representations can be widely leveraged across remote sensing applications.

Related papers

LIDAR: Lightweight Adaptive Cue-Aware Fusion Vision Mamba for Multimodal Segmentation of Structural Cracks [27.57718303520023]
We propose a Lightweight Adaptive Cue-Aware Vision Mamba network.<n>It efficiently perceives and integrates morphological and textural cues from different modalities under multimodal crack scenarios.<n>Our method achieves 0.8204 in F1 and 0.8465 in mIoU with only 5.35M parameters.
arXiv Detail & Related papers (2025-07-30T08:28:20Z)
MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies [12.485905108032146]
This paper introduces MetaOcc, a novel multi-modal framework for omni-oriented 3D occupancy prediction.<n>To address the limitations of directly applying encoders to sparse radar data, we propose a Radar Height Self-Attention module.<n>To reduce reliance on expensive point cloud, we propose a pseudo-label generation pipeline based on an open-set segmentor.
arXiv Detail & Related papers (2025-01-26T03:51:56Z)
LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving [52.83707400688378]
LargeAD is a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection.
arXiv Detail & Related papers (2025-01-07T18:59:59Z)
GSPR: Multimodal Place Recognition Using 3D Gaussian Splatting for Autonomous Driving [9.023864430027333]
multimodal place recognition has gained increasing attention due to their ability to overcome weaknesses of uni sensor systems. We propose a 3D Gaussian-based multimodal place recognition neural network dubbed GSPR.
arXiv Detail & Related papers (2024-10-01T00:43:45Z)
4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders [53.297697898510194]
We propose a joint modeling scheme where four decoders share the same encoder -- we refer to this as 4D modeling. To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning. In addition, we propose three novel one-pass beam search algorithms by combining three decoders.
arXiv Detail & Related papers (2024-06-05T05:18:20Z)
Multiway Point Cloud Mosaicking with Diffusion and Global Optimization [74.3802812773891]
We introduce a novel framework for multiway point cloud mosaicking (named Wednesday) At the core of our approach is ODIN, a learned pairwise registration algorithm that identifies overlaps and refines attention scores. Tested on four diverse, large-scale datasets, our method state-of-the-art pairwise and rotation registration results by a large margin on all benchmarks.
arXiv Detail & Related papers (2024-03-30T17:29:13Z)
Fus-MAE: A cross-attention-based data fusion approach for Masked Autoencoders in remote sensing [5.070981175240306]
Fus-MAE is a self-supervised learning framework based on masked autoencoders. Our empirical findings demonstrate that Fus-MAE can effectively compete with contrastive learning strategies tailored for SAR-optical data fusion.
arXiv Detail & Related papers (2024-01-05T11:36:21Z)
KECOR: Kernel Coding Rate Maximization for Active 3D Object Detection [48.66703222700795]
We resort to a novel kernel strategy to identify the most informative point clouds to acquire labels. To accommodate both one-stage (i.e., SECOND) and two-stage detectors, we incorporate the classification entropy tangent and well trade-off between detection performance and the total number of bounding boxes selected for annotation. Our results show that approximately 44% box-level annotation costs and 26% computational time are reduced compared to the state-of-the-art method.
arXiv Detail & Related papers (2023-07-16T04:27:03Z)
PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection [26.03582038710992]
Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities. In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world. We propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects.
arXiv Detail & Related papers (2023-03-14T17:58:03Z)
GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds [72.60362979456035]
Masked Autoencoders (MAE) are challenging to explore in large-scale 3D point clouds. We propose a textbfGenerative textbfDecoder for MAE (GD-MAE) to automatically merges the surrounding context. We demonstrate the efficacy of the proposed method on several large-scale benchmarks: KITTI, and ONCE.
arXiv Detail & Related papers (2022-12-06T14:32:55Z)
Shared Manifold Learning Using a Triplet Network for Multiple Sensor Translation and Fusion with Missing Data [2.452410403088629]
We propose a Contrastive learning based MultiModal Alignment Network (CoMMANet) to align data from different sensors into a shared and discriminative manifold. The proposed architecture uses a multimodal triplet autoencoder to cluster the latent space in such a way that samples of the same classes from each heterogeneous modality are mapped close to each other.
arXiv Detail & Related papers (2022-10-25T20:22:09Z)
FusionRCNN: LiDAR-Camera Fusion for Two-stage 3D Object Detection [11.962073589763676]
Existing 3D detectors significantly improve the accuracy by adopting a two-stage paradigm. The sparsity of point clouds, especially for the points far away, makes it difficult for the LiDAR-only refinement module to accurately recognize and locate objects. We propose a novel multi-modality two-stage approach named FusionRCNN, which effectively and efficiently fuses point clouds and camera images in the Regions of Interest(RoI) FusionRCNN significantly improves the strong SECOND baseline by 6.14% mAP on baseline, and outperforms competing two-stage approaches.
arXiv Detail & Related papers (2022-09-22T02:07:25Z)
Inertial Hallucinations -- When Wearable Inertial Devices Start Seeing Things [82.15959827765325]
We propose a novel approach to multimodal sensor fusion for Ambient Assisted Living (AAL) We address two major shortcomings of standard multimodal approaches, limited area coverage and reduced reliability. Our new framework fuses the concept of modality hallucination with triplet learning to train a model with different modalities to handle missing sensors at inference time.
arXiv Detail & Related papers (2022-07-14T10:04:18Z)
Multimodal Object Detection via Bayesian Fusion [59.31437166291557]
We study multimodal object detection with RGB and thermal cameras, since the latter can provide much stronger object signatures under poor illumination. Our key contribution is a non-learned late-fusion method that fuses together bounding box detections from different modalities. We apply our approach to benchmarks containing both aligned (KAIST) and unaligned (FLIR) multimodal sensor data.
arXiv Detail & Related papers (2021-04-07T04:03:20Z)
Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture. We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions. Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.