Probabilistic Hyper-Graphs using Multiple Randomly Masked Autoencoders for Semi-supervised Multi-modal Multi-task Learning
- URL: http://arxiv.org/abs/2510.10068v1
- Date: Sat, 11 Oct 2025 07:05:34 GMT
- Title: Probabilistic Hyper-Graphs using Multiple Randomly Masked Autoencoders for Semi-supervised Multi-modal Multi-task Learning
- Authors: Pîrvu Mihai-Cristian, Leordeanu Marius,
- Abstract summary: We introduce Probabilistic Hyper-Graphs using Masked Autoencoders (PHG-MAE)<n>PHG-MAE is a novel model that unifies the classical work on neural graphs.<n>We show that we can apply knowledge distillation on top of the ensembles with little loss in performance.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The computer vision domain has greatly benefited from an abundance of data across many modalities to improve on various visual tasks. Recently, there has been a lot of focus on self-supervised pre-training methods through Masked Autoencoders (MAE) \cite{he2022masked,bachmann2022multimae}, usually used as a first step before optimizing for a downstream task, such as classification or regression. This is very useful as it doesn't require any manually labeled data. In this work, we introduce Probabilistic Hyper-Graphs using Masked Autoencoders (PHG-MAE): a novel model that unifies the classical work on neural graphs \cite{leordeanu2021semi} with the modern approach of masked autoencoders under a common theoretical framework. Through random masking of entire modalities, not just patches, the model samples from the distribution of hyper-edges on each forward pass. Additionally, the model adapts the standard MAE algorithm by combining pre-training and fine-tuning into a single training loop. Moreover, our approach enables the creation of inference-time ensembles which, through aggregation, boost the final prediction performance and consistency. Lastly, we show that we can apply knowledge distillation on top of the ensembles with little loss in performance, even with models that have fewer than 1M parameters. While our work mostly focuses on outdoor UAV scenes that contain multiple world interpretations and modalities, the same steps can be followed in other similar domains, such as autonomous driving or indoor robotics. In order to streamline the process of integrating external pre-trained experts for computer vision multi-modal multi-task learning (MTL) scenarios, we developed a data-pipeline software. Using this tool, we have created and released a fully-automated extension of the Dronescapes dataset. All the technical details, code and reproduction steps are publicly released.
Related papers
- Understanding and Enhancing Mask-Based Pretraining towards Universal Representations [13.262679155411599]
Mask-based pretraining has become a cornerstone of modern large-scale models across language, vision, and biology.<n>We show that the behavior of mask-based pretraining can be directly characterized by test risk in high-dimensional minimum-norm ("ridge-less") linear regression.<n>We propose Randomly Random Mask Auto (R$2$MAE), which enforces capturing multi-scale features from data.
arXiv Detail & Related papers (2025-09-25T22:08:25Z) - MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks [11.359741665798195]
This paper explores a more flexible multi-modal, multi-task pre-training strategy for Earth Observation (EO) data.<n>Specifically, we adopt a Multi-modal Multi-task Masked Autoencoder (MultiMAE) that we pre-train by reconstructing diverse input modalities.<n>Our approach exhibits significant flexibility, handling diverse input configurations without requiring modality-specific pre-trained models.
arXiv Detail & Related papers (2025-05-20T22:24:36Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning [9.540487697801531]
MMEarth is a diverse multi-modal pretraining dataset at global scale.
We propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images.
arXiv Detail & Related papers (2024-05-04T23:16:48Z) - Task-customized Masked AutoEncoder via Mixture of Cluster-conditional
Experts [104.9871176044644]
Masked Autoencoder(MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training.
We propose a novel MAE-based pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE)
MoCE trains each expert only with semantically relevant images by using cluster-conditional gates.
arXiv Detail & Related papers (2024-02-08T03:46:32Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models.
We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z) - Multi-scale Transformer Network with Edge-aware Pre-training for
Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones.
Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model.
We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.