Structured-Noise Masked Modeling for Video, Audio and Beyond
- URL: http://arxiv.org/abs/2503.16311v1
- Date: Thu, 20 Mar 2025 16:34:14 GMT
- Title: Structured-Noise Masked Modeling for Video, Audio and Beyond
- Authors: Aritra Bhowmik, Fida Mohammad Thoker, Carlos Hinojosa, Bernard Ghanem, Cees G. M. Snoek,
- Abstract summary: We introduce structured noise-based masking, a simple yet effective approach that naturally aligns with the spatial, temporal, and spectral characteristics of video and audio data.<n>Our approach improves the performance of masked video and audio modeling frameworks without any computational overhead.
- Score: 74.6545360752605
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked modeling has emerged as a powerful self-supervised learning framework, but existing methods largely rely on random masking, disregarding the structural properties of different modalities. In this work, we introduce structured noise-based masking, a simple yet effective approach that naturally aligns with the spatial, temporal, and spectral characteristics of video and audio data. By filtering white noise into distinct color noise distributions, we generate structured masks that preserve modality-specific patterns without requiring handcrafted heuristics or access to the data. Our approach improves the performance of masked video and audio modeling frameworks without any computational overhead. Extensive experiments demonstrate that structured noise masking achieves consistent improvement over random masking for standard and advanced masked modeling methods, highlighting the importance of modality-aware masking strategies for representation learning.
Related papers
- Evolved Hierarchical Masking for Self-Supervised Learning [49.77271430882176]
Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training.
This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning.
arXiv Detail & Related papers (2025-04-12T09:40:14Z) - EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning [46.66166658067071]
EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling) is a novel self-supervised learning approach for speech representation learning.
We introduce a novel selective and adaptive masking strategy for Masked Acoustic Modeling (MAM)
EH-MAM outperforms several state-of-the-art baselines across various low-resource speech recognition and SUPERB benchmarks by 5%-10%.
arXiv Detail & Related papers (2024-10-17T02:59:22Z) - ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework.
We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise.
We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z) - Asymmetric Mask Scheme for Self-Supervised Real Image Denoising [14.18283674891189]
We propose a single mask scheme for self-supervised denoising training, which eliminates the need for blind spot operation.
Our method, featuring the asymmetric mask scheme in training and inference, achieves state-of-the-art performance on existing real noisy image datasets.
arXiv Detail & Related papers (2024-07-09T03:01:28Z) - Denoising Diffusion Semantic Segmentation with Mask Prior Modeling [61.73352242029671]
We propose to ameliorate the semantic segmentation quality of existing discriminative approaches with a mask prior modeled by a denoising diffusion generative model.
We evaluate the proposed prior modeling with several off-the-shelf segmentors, and our experimental results on ADE20K and Cityscapes demonstrate that our approach could achieve competitively quantitative performance.
arXiv Detail & Related papers (2023-06-02T17:47:01Z) - Mask to reconstruct: Cooperative Semantics Completion for Video-text
Retrieval [19.61947785487129]
Mask for Semantics Completion (MASCOT) based on semantic-based masked modeling.
Our MASCOT performs state-of-the-art performance on four major text-video retrieval benchmarks.
arXiv Detail & Related papers (2023-05-13T12:31:37Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Masked Frequency Modeling for Self-Supervised Visual Pre-Training [102.89756957704138]
We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models.
MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum.
For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token.
arXiv Detail & Related papers (2022-06-15T17:58:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.