Related papers: MARMOT: Masked Autoencoder for Modeling Transient Imaging

MARMOT: Masked Autoencoder for Modeling Transient Imaging

URL: http://arxiv.org/abs/2506.08470v1
Date: Tue, 10 Jun 2025 05:49:22 GMT
Title: MARMOT: Masked Autoencoder for Modeling Transient Imaging
Authors: Siyuan Shen, Ziheng Wang, Xingyue Peng, Suan Xia, Ruiqian Li, Shiying Li, Jingyi Yu,
Abstract summary: We present a masked autoencoder for modeling transient imaging, or MARMOT, to facilitate non-line-of-sight (NLOS) applications.<n>Our MARMOT is a self-supervised model pretrianed on massive and diverse NLOS transient datasets.
Score: 30.865812827455326
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Pretrained models have demonstrated impressive success in many modalities such as language and vision. Recent works facilitate the pretraining paradigm in imaging research. Transients are a novel modality, which are captured for an object as photon counts versus arrival times using a precisely time-resolved sensor. In particular for non-line-of-sight (NLOS) scenarios, transients of hidden objects are measured beyond the sensor's direct line of sight. Using NLOS transients, the majority of previous works optimize volume density or surfaces to reconstruct the hidden objects and do not transfer priors learned from datasets. In this work, we present a masked autoencoder for modeling transient imaging, or MARMOT, to facilitate NLOS applications. Our MARMOT is a self-supervised model pretrianed on massive and diverse NLOS transient datasets. Using a Transformer-based encoder-decoder, MARMOT learns features from partially masked transients via a scanning pattern mask (SPM), where the unmasked subset is functionally equivalent to arbitrary sampling, and predicts full measurements. Pretrained on TransVerse-a synthesized transient dataset of 500K 3D models-MARMOT adapts to downstream imaging tasks using direct feature transfer or decoder finetuning. Comprehensive experiments are carried out in comparisons with state-of-the-art methods. Quantitative and qualitative results demonstrate the efficiency of our MARMOT.

Related papers

$γ$-Quant: Towards Learnable Quantization for Low-bit Pattern Recognition [28.31494154592102]
We propose $gamma$-Quant, i.e., the task-specific learning of a non-linear quantization for pattern recognition.<n>We demonstrate that raw data with a learnable quantization using as few as 4-bits can perform on par with the use of raw 12-bit data.
arXiv Detail & Related papers (2025-09-26T15:03:55Z)
Time Step Generating: A Universal Synthesized Deepfake Image Detector [0.4488895231267077]
We propose a universal synthetic image detector Time Step Generating (TSG) TSG does not rely on pre-trained models' reconstructing ability, specific datasets, or sampling algorithms. We test the proposed TSG on the large-scale GenImage benchmark and it achieves significant improvements in both accuracy and generalizability.
arXiv Detail & Related papers (2024-11-17T09:39:50Z)
MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks. transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks. We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection. Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z)
Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery [78.43828998065071]
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks. Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data. In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
arXiv Detail & Related papers (2024-03-08T16:18:04Z)
Morphing Tokens Draw Strong Masked Image Models [28.356863521946607]
Masked image modeling (MIM) has emerged as a promising approach for pre-training Vision Transformers (ViTs)<n>We introduce Dynamic Token Morphing (DTM), a novel method that dynamically aggregates tokens while preserving context to generate contextualized targets.<n>DTM is compatible with various SSL frameworks; we showcase significantly improved MIM results, barely introducing extra training costs.
arXiv Detail & Related papers (2023-12-30T14:53:09Z)
ParGANDA: Making Synthetic Pedestrians A Reality For Object Detection [2.7648976108201815]
We propose to use a Generative Adversarial Network (GAN) to close the gap between the real and synthetic data. Our approach not only produces visually plausible samples but also does not require any labels of the real domain.
arXiv Detail & Related papers (2023-07-21T05:26:32Z)
Multitask AET with Orthogonal Tangent Regularity for Dark Object Detection [84.52197307286681]
We propose a novel multitask auto encoding transformation (MAET) model to enhance object detection in a dark environment. In a self-supervision manner, the MAET learns the intrinsic visual structure by encoding and decoding the realistic illumination-degrading transformation. We have achieved the state-of-the-art performance using synthetic and real-world datasets.
arXiv Detail & Related papers (2022-05-06T16:27:14Z)
Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD) It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking [74.82415271960315]
We propose a solution named TransMOT to efficiently model the spatial and temporal interactions among objects in a video. TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy. The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20.
arXiv Detail & Related papers (2021-04-01T01:49:05Z)
Shape My Face: Registering 3D Face Scans by Surface-to-Surface Translation [75.59415852802958]
Shape-My-Face (SMF) is a powerful encoder-decoder architecture based on an improved point cloud encoder, a novel visual attention mechanism, graph convolutional decoders with skip connections, and a specialized mouth model. Our model provides topologically-sound meshes with minimal supervision, offers faster training time, has orders of magnitude fewer trainable parameters, is more robust to noise, and can generalize to previously unseen datasets.
arXiv Detail & Related papers (2020-12-16T20:02:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.