BIMM: Brain Inspired Masked Modeling for Video Representation Learning
- URL: http://arxiv.org/abs/2405.12757v1
- Date: Tue, 21 May 2024 13:09:04 GMT
- Title: BIMM: Brain Inspired Masked Modeling for Video Representation Learning
- Authors: Zhifan Wan, Jie Zhang, Changzhen Li, Shiguang Shan,
- Abstract summary: We propose the Brain Inspired Masked Modeling (BIMM) framework, aiming to learn comprehensive representations from videos.
Specifically, our approach consists of ventral and dorsal branches, which learn image and video representations, respectively.
To achieve the goals of different visual cortices in the brain, we segment the encoder of each branch into three intermediate blocks and reconstruct progressive prediction targets with light weight decoders.
- Score: 47.56270575865621
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The visual pathway of human brain includes two sub-pathways, ie, the ventral pathway and the dorsal pathway, which focus on object identification and dynamic information modeling, respectively. Both pathways comprise multi-layer structures, with each layer responsible for processing different aspects of visual information. Inspired by visual information processing mechanism of the human brain, we propose the Brain Inspired Masked Modeling (BIMM) framework, aiming to learn comprehensive representations from videos. Specifically, our approach consists of ventral and dorsal branches, which learn image and video representations, respectively. Both branches employ the Vision Transformer (ViT) as their backbone and are trained using masked modeling method. To achieve the goals of different visual cortices in the brain, we segment the encoder of each branch into three intermediate blocks and reconstruct progressive prediction targets with light weight decoders. Furthermore, drawing inspiration from the information-sharing mechanism in the visual pathways, we propose a partial parameter sharing strategy between the branches during training. Extensive experiments demonstrate that BIMM achieves superior performance compared to the state-of-the-art methods.
Related papers
- Decoding Visual Experience and Mapping Semantics through Whole-Brain Analysis Using fMRI Foundation Models [10.615012396285337]
We develop algorithms to enhance our understanding of visual processes by incorporating whole-brain activation maps.
We first compare our method with state-of-the-art approaches to decoding visual processing and show improved predictive semantic accuracy by 43%.
arXiv Detail & Related papers (2024-11-11T16:51:17Z) - Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance [3.74142789780782]
We show how modern LDMs incorporate multi-modal guidance for structurally and semantically plausible image generations.
Brain-Streams maps fMRI signals from brain regions to appropriate embeddings.
We validate the reconstruction ability of Brain-Streams both quantitatively and qualitatively on a real fMRI dataset.
arXiv Detail & Related papers (2024-09-18T16:19:57Z) - A Dual-Stream Neural Network Explains the Functional Segregation of
Dorsal and Ventral Visual Pathways in Human Brains [8.24969449883056]
We develop a dual-stream vision model inspired by the human eyes and brain.
At the input level, the model samples two complementary visual patterns.
At the backend, the model processes the separate input patterns through two branches of convolutional neural networks.
arXiv Detail & Related papers (2023-10-20T22:47:40Z) - DREAM: Visual Decoding from Reversing Human Visual System [43.6339793925953]
We present DREAM, an fMRI-to-image method for reconstructing viewed images from brain activities.
We craft reverse pathways that emulate the hierarchical and parallel nature of how humans perceive the visual world.
arXiv Detail & Related papers (2023-10-03T17:59:58Z) - Biologically-Motivated Learning Model for Instructed Visual Processing [3.105144691395886]
Current models of biologically plausible learning often use a cortical-like combination of bottom-up (BU) and top-down (TD) processing.
In the visual cortex, the TD pathway plays a second major role of visual attention, by guiding the visual process to locations and tasks of interest.
We introduce a model that uses a cortical-like combination of BU and TD processing that naturally integrates the two major functions of the TD stream.
arXiv Detail & Related papers (2023-06-04T17:38:06Z) - Controllable Mind Visual Diffusion Model [58.83896307930354]
Brain signal visualization has emerged as an active research area, serving as a critical interface between the human visual system and computer vision models.
We propose a novel approach, referred to as Controllable Mind Visual Model Diffusion (CMVDM)
CMVDM extracts semantic and silhouette information from fMRI data using attribute alignment and assistant networks.
We then leverage a control model to fully exploit the extracted information for image synthesis, resulting in generated images that closely resemble the visual stimuli in terms of semantics and silhouette.
arXiv Detail & Related papers (2023-05-17T11:36:40Z) - Joint fMRI Decoding and Encoding with Latent Embedding Alignment [77.66508125297754]
We introduce a unified framework that addresses both fMRI decoding and encoding.
Our model concurrently recovers visual stimuli from fMRI signals and predicts brain activity from images within a unified framework.
arXiv Detail & Related papers (2023-03-26T14:14:58Z) - Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z) - Peripheral Vision Transformer [52.55309200601883]
We take a biologically inspired approach and explore to model peripheral vision in deep neural networks for visual recognition.
We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data.
We evaluate the proposed network, dubbed PerViT, on the large-scale ImageNet dataset and systematically investigate the inner workings of the model for machine perception.
arXiv Detail & Related papers (2022-06-14T12:47:47Z) - Where to Look and How to Describe: Fashion Image Retrieval with an
Attentional Heterogeneous Bilinear Network [50.19558726384559]
We propose a biologically inspired framework for image-based fashion product retrieval.
Our proposed framework achieves satisfactory performance on three image-based fashion product retrieval benchmarks.
arXiv Detail & Related papers (2020-10-26T06:01:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.