MindShot: Multi-Shot Video Reconstruction from fMRI with LLM Decoding
- URL: http://arxiv.org/abs/2508.02480v1
- Date: Mon, 04 Aug 2025 14:47:17 GMT
- Title: MindShot: Multi-Shot Video Reconstruction from fMRI with LLM Decoding
- Authors: Wenwen Zeng, Yonghuang Wu, Yifan Chen, Xuan Xie, Chengqian Zhao, Feiyu Yin, Guoqing Wu, Jinhua Yu,
- Abstract summary: We propose a novel divide-and-decode framework for multi-shot fMRI video reconstruction.<n>Our core innovations are: (1) A shot boundary predictor module explicitly decomposing mixed fMRI signals into shot-specific segments.<n> (2) Generative captioning using LLMs, which decodes robust textual descriptions from each segment, overcoming temporal blur by leveraging high-level semantics.
- Score: 7.066210443745838
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reconstructing dynamic videos from fMRI is important for understanding visual cognition and enabling vivid brain-computer interfaces. However, current methods are critically limited to single-shot clips, failing to address the multi-shot nature of real-world experiences. Multi-shot reconstruction faces fundamental challenges: fMRI signal mixing across shots, the temporal resolution mismatch between fMRI and video obscuring rapid scene changes, and the lack of dedicated multi-shot fMRI-video datasets. To overcome these limitations, we propose a novel divide-and-decode framework for multi-shot fMRI video reconstruction. Our core innovations are: (1) A shot boundary predictor module explicitly decomposing mixed fMRI signals into shot-specific segments. (2) Generative keyframe captioning using LLMs, which decodes robust textual descriptions from each segment, overcoming temporal blur by leveraging high-level semantics. (3) Novel large-scale data synthesis (20k samples) from existing datasets. Experimental results demonstrate our framework outperforms state-of-the-art methods in multi-shot reconstruction fidelity. Ablation studies confirm the critical role of fMRI decomposition and semantic captioning, with decomposition significantly improving decoded caption CLIP similarity by 71.8%. This work establishes a new paradigm for multi-shot fMRI reconstruction, enabling accurate recovery of complex visual narratives through explicit decomposition and semantic prompting.
Related papers
- NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction [29.030311713701295]
We propose NeuroClips, an innovative framework to decode high-fidelity and smooth video from fMRI.<n>NeuroClips utilizes a semanticsor to reconstruct videos, guiding semantic accuracy and consistency, and employs a perception reconstructor to capture low-level perceptual details.<n>NeuroClips achieves smooth high-fidelity video reconstruction of up to 6s at 8FPS, gaining significant improvements over state-of-the-art models in various metrics.
arXiv Detail & Related papers (2024-10-25T10:28:26Z) - MindFormer: Semantic Alignment of Multi-Subject fMRI for Brain Decoding [50.55024115943266]
We introduce a novel semantic alignment method of multi-subject fMRI signals using so-called MindFormer.
This model is specifically designed to generate fMRI-conditioned feature vectors that can be used for conditioning Stable Diffusion model for fMRI- to-image generation or large language model (LLM) for fMRI-to-text generation.
Our experimental results demonstrate that MindFormer generates semantically consistent images and text across different subjects.
arXiv Detail & Related papers (2024-05-28T00:36:25Z) - Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity [13.04953215936574]
We propose a two-stage model named Mind-Animator to reconstruct human dynamic vision from brain activity.<n>During the fMRI-to-feature stage, we decouple semantic, structure, and motion features from fMRI.<n>In the feature-to-video stage, these features are integrated into videos using an inflated Stable Diffusion.
arXiv Detail & Related papers (2024-05-06T08:56:41Z) - NeuroPictor: Refining fMRI-to-Image Reconstruction via Multi-individual Pretraining and Multi-level Modulation [55.51412454263856]
This paper proposes to directly modulate the generation process of diffusion models using fMRI signals.
By training with about 67,000 fMRI-image pairs from various individuals, our model enjoys superior fMRI-to-image decoding capacity.
arXiv Detail & Related papers (2024-03-27T02:42:52Z) - Fill the K-Space and Refine the Image: Prompting for Dynamic and
Multi-Contrast MRI Reconstruction [31.404228406642194]
The key to dynamic or multi-contrast magnetic resonance imaging (MRI) reconstruction lies in exploring inter-frame or inter-contrast information.
We propose a two-stage MRI reconstruction pipeline to address these limitations.
Our proposed method significantly outperforms previous state-of-the-art accelerated MRI reconstruction methods.
arXiv Detail & Related papers (2023-09-25T02:51:00Z) - Natural scene reconstruction from fMRI signals using generative latent
diffusion [1.90365714903665]
We present a two-stage scene reconstruction framework called Brain-Diffuser''
In the first stage, we reconstruct images that capture low-level properties and overall layout using a VDVAE (Very Deep Vari Autoencoder) model.
In the second stage, we use the image-to-image framework of a latent diffusion model conditioned on predicted multimodal (text and visual) features.
arXiv Detail & Related papers (2023-03-09T15:24:26Z) - BrainCLIP: Bridging Brain and Visual-Linguistic Representation Via CLIP
for Generic Natural Visual Stimulus Decoding [51.911473457195555]
BrainCLIP is a task-agnostic fMRI-based brain decoding model.
It bridges the modality gap between brain activity, image, and text.
BrainCLIP can reconstruct visual stimuli with high semantic fidelity.
arXiv Detail & Related papers (2023-02-25T03:28:54Z) - Holistic Multi-Slice Framework for Dynamic Simultaneous Multi-Slice MRI
Reconstruction [8.02450593595801]
We propose a novel DL-based framework for dynamic SMS reconstruction.
Our main contributions are 1) a combination of data transformation steps and network design that effectively leverages the unique characteristics of undersampled dynamic SMS data, and 2) an MR physics-guided transfer learning strategy that addresses the data scarcity issue.
arXiv Detail & Related papers (2023-01-03T21:09:51Z) - Model-Guided Multi-Contrast Deep Unfolding Network for MRI
Super-resolution Reconstruction [68.80715727288514]
We show how to unfold an iterative MGDUN algorithm into a novel model-guided deep unfolding network by taking the MRI observation matrix.
In this paper, we propose a novel Model-Guided interpretable Deep Unfolding Network (MGDUN) for medical image SR reconstruction.
arXiv Detail & Related papers (2022-09-15T03:58:30Z) - Transformer-empowered Multi-scale Contextual Matching and Aggregation
for Multi-contrast MRI Super-resolution [55.52779466954026]
Multi-contrast super-resolution (SR) reconstruction is promising to yield SR images with higher quality.
Existing methods lack effective mechanisms to match and fuse these features for better reconstruction.
We propose a novel network to address these problems by developing a set of innovative Transformer-empowered multi-scale contextual matching and aggregation techniques.
arXiv Detail & Related papers (2022-03-26T01:42:59Z) - Multi-modal Aggregation Network for Fast MR Imaging [85.25000133194762]
We propose a novel Multi-modal Aggregation Network, named MANet, which is capable of discovering complementary representations from a fully sampled auxiliary modality.
In our MANet, the representations from the fully sampled auxiliary and undersampled target modalities are learned independently through a specific network.
Our MANet follows a hybrid domain learning framework, which allows it to simultaneously recover the frequency signal in the $k$-space domain.
arXiv Detail & Related papers (2021-10-15T13:16:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.