Related papers: BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling

BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling

URL: http://arxiv.org/abs/2311.17218v1
Date: Tue, 28 Nov 2023 20:42:30 GMT
Title: BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling
Authors: Yixuan Luo, Mengye Ren, Sai Qian Zhang
Abstract summary: Masked image modeling (MIM) aims to extract valuable insights from image patches to enhance the feature extraction capabilities of the underlying deep neural network (DNN)
Score: 18.861945284506028
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Like masked language modeling (MLM) in natural language processing, masked image modeling (MIM) aims to extract valuable insights from image patches to enhance the feature extraction capabilities of the underlying deep neural network (DNN). Contrasted with other training paradigms like supervised learning and unsupervised contrastive learning, masked image modeling (MIM) pretraining typically demands significant computational resources in order to manage large training data batches (e.g., 4096). The significant memory and computation requirements pose a considerable challenge to its broad adoption. To mitigate this, we introduce a novel learning framework, termed~\textit{Block-Wise Masked Image Modeling} (BIM). This framework involves decomposing the MIM tasks into several sub-tasks with independent computation patterns, resulting in block-wise back-propagation operations instead of the traditional end-to-end approach. Our proposed BIM maintains superior performance compared to conventional MIM while greatly reducing peak memory consumption. Moreover, BIM naturally enables the concurrent training of numerous DNN backbones of varying depths. This leads to the creation of multiple trained DNN backbones, each tailored to different hardware platforms with distinct computing capabilities. This approach significantly reduces computational costs in comparison with training each DNN backbone individually. Our framework offers a promising solution for resource constrained training of MIM.

Related papers

MINR: Implicit Neural Representations with Masked Image Modelling [5.330266804358638]
Masked autoencoders (MAE) have shown significant promise in learning robust feature representations.<n>We introduce the masked implicit neural representations (MINR) framework that synergizes implicit neural representations with masked image modeling.<n>MINR learns a continuous function to represent images, enabling more robust and generalizable reconstructions irrespective of masking strategies.
arXiv Detail & Related papers (2025-07-30T06:12:57Z)
LatentLLM: Attention-Aware Joint Tensor Compression [50.33925662486034]
Large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources.<n>We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure.
arXiv Detail & Related papers (2025-05-23T22:39:54Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework. Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss. We show that MMAR demonstrates much more superior performance than other joint multi-modal models.
arXiv Detail & Related papers (2024-10-14T17:57:18Z)
SG-MIM: Structured Knowledge Guided Efficient Pre-training for Dense Prediction [17.44991827937427]
Masked Image Modeling techniques have redefined the landscape of computer vision. Despite their success, the full potential of MIM-based methods in dense prediction tasks, particularly in depth estimation, remains untapped. We propose SG-MIM, a novel Structured knowledge Guided Masked Image Modeling framework designed to enhance dense prediction tasks by utilizing structured knowledge alongside images.
arXiv Detail & Related papers (2024-09-04T08:24:53Z)
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs) In particular, we study the importance of various architecture components and data choices. We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z)
Deep learning enhanced mixed integer optimization: Learning to reduce model dimensionality [0.0]
This work introduces a framework to address the computational complexity inherent in Mixed-Integer Programming. By employing deep learning, we construct problem-specific models that identify and exploit common structures across MIP instances. We present an algorithm for generating synthetic data enhancing the robustness and generalizability of our models.
arXiv Detail & Related papers (2024-01-17T19:15:13Z)
Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images. We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy. Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z)
A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs) MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z)
Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning. We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z)
Meta Fine-Tuning Neural Language Models for Multi-Domain Text Mining [37.2106265998237]
We propose an effective learning procedure named Meta Fine-Tuning (MFT) MFT serves as a meta-learner to solve a group of similar NLP tasks for neural language models. We implement MFT upon BERT to solve several multi-domain text mining tasks.
arXiv Detail & Related papers (2020-03-29T11:27:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.