Related papers: SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance

SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance

URL: http://arxiv.org/abs/2602.21819v2
Date: Fri, 27 Feb 2026 08:58:58 GMT
Title: SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance
Authors: Minghan Yang, Lan Yang, Ke Li, Honggang Zhang, Kaiyue Pang, Yizhe Song,
Abstract summary: We introduce SemVideo, a novel fMRI-to-video reconstruction framework guided by hierarchical semantic information.<n>At the core of SemVideo is SemMiner, a hierarchical guidance module that constructs three levels of semantic cues from the original video stimulus.<n>We show that SemVideo achieves superior performance in both semantic alignment and temporal consistency, setting a new state-of-the-art in fMRI-to-video reconstruction.
Score: 52.34513874272676
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reconstructing dynamic visual experiences from brain activity provides a compelling avenue for exploring the neural mechanisms of human visual perception. While recent progress in fMRI-based image reconstruction has been notable, extending this success to video reconstruction remains a significant challenge. Current fMRI-to-video reconstruction approaches consistently encounter two major shortcomings: (i) inconsistent visual representations of salient objects across frames, leading to appearance mismatches; (ii) poor temporal coherence, resulting in motion misalignment or abrupt frame transitions. To address these limitations, we introduce SemVideo, a novel fMRI-to-video reconstruction framework guided by hierarchical semantic information. At the core of SemVideo is SemMiner, a hierarchical guidance module that constructs three levels of semantic cues from the original video stimulus: static anchor descriptions, motion-oriented narratives, and holistic summaries. Leveraging this semantic guidance, SemVideo comprises three key components: a Semantic Alignment Decoder that aligns fMRI signals with CLIP-style embeddings derived from SemMiner, a Motion Adaptation Decoder that reconstructs dynamic motion patterns using a novel tripartite attention fusion architecture, and a Conditional Video Render that leverages hierarchical semantic guidance for video reconstruction. Experiments conducted on the CC2017 and HCP datasets demonstrate that SemVideo achieves superior performance in both semantic alignment and temporal consistency, setting a new state-of-the-art in fMRI-to-video reconstruction.

Related papers

STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution [60.06664986365803]
We present STCDiT, a video super-resolution framework built upon a pre-trained video diffusion model.<n>It aims to restore structurally faithful and temporally stable videos from degraded inputs, even under complex camera motions.
arXiv Detail & Related papers (2025-11-24T05:37:23Z)
DynaMind: Reconstructing Dynamic Visual Scenes from EEG by Aligning Temporal Dynamics and Multimodal Semantics to Guided Diffusion [10.936858717759156]
We introduce DynaMind, a novel framework that reconstructs video by jointly modeling neural dynamics and semantic features.<n>On the SEED-DV dataset, DynaMind sets a new state-of-the-art (SOTA), boosting reconstructed video accuracies by 12.5 and 10.3 percentage points.<n>This marks a critical advancement, bridging the gap between neural dynamics and high-fidelity visual semantics.
arXiv Detail & Related papers (2025-09-01T06:52:08Z)
HumanGenesis: Agent-Based Geometric and Generative Modeling for Synthetic Human Dynamics [60.737929335600015]
We present textbfHumanGenesis, a framework that integrates geometric and generative modeling through four collaborative agents.<n>HumanGenesis achieves state-of-the-art performance on tasks including text-guided synthesis, video reenactment, and novel-pose generalization.
arXiv Detail & Related papers (2025-08-13T14:50:19Z)
MindShot: Multi-Shot Video Reconstruction from fMRI with LLM Decoding [7.066210443745838]
We propose a novel divide-and-decode framework for multi-shot fMRI video reconstruction.<n>Our core innovations are: (1) A shot boundary predictor module explicitly decomposing mixed fMRI signals into shot-specific segments.<n> (2) Generative captioning using LLMs, which decodes robust textual descriptions from each segment, overcoming temporal blur by leveraging high-level semantics.
arXiv Detail & Related papers (2025-08-04T14:47:17Z)
SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation [56.90807453045657]
SynMotion is a motion-customized video generation model that jointly leverages semantic guidance and visual adaptation.<n>At the semantic level, we introduce the dual-em semantic comprehension mechanism which disentangles subject and motion representations.<n>At the visual level, we integrate efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence.
arXiv Detail & Related papers (2025-06-30T10:09:32Z)
DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding [82.91021399231184]
Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information.<n>We propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals.<n>It first decomposes the video into three components - semantic, spatial, and motion - then decodes each component separately before fusing them to reconstruct the video.
arXiv Detail & Related papers (2025-04-01T05:28:37Z)
Neurons: Emulating the Human Visual Cortex Improves Fidelity and Interpretability in fMRI-to-Video Reconstruction [13.110669865114533]
NEURONS is a concept framework that decouples learning into four correlated sub-tasks.<n>It simulates the visual cortex's functional specialization, allowing the model to capture diverse video content.<n>NEURONS shows a strong functional correlation with the visual cortex, highlighting its potential for brain-computer interfaces and clinical applications.
arXiv Detail & Related papers (2025-03-14T08:12:28Z)
NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction [29.030311713701295]
We propose NeuroClips, an innovative framework to decode high-fidelity and smooth video from fMRI.<n>NeuroClips utilizes a semanticsor to reconstruct videos, guiding semantic accuracy and consistency, and employs a perception reconstructor to capture low-level perceptual details.<n>NeuroClips achieves smooth high-fidelity video reconstruction of up to 6s at 8FPS, gaining significant improvements over state-of-the-art models in various metrics.
arXiv Detail & Related papers (2024-10-25T10:28:26Z)
Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity [13.04953215936574]
We propose a two-stage model named Mind-Animator to reconstruct human dynamic vision from brain activity.<n>During the fMRI-to-feature stage, we decouple semantic, structure, and motion features from fMRI.<n>In the feature-to-video stage, these features are integrated into videos using an inflated Stable Diffusion.
arXiv Detail & Related papers (2024-05-06T08:56:41Z)
Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video Retrieval [67.52910255064762]
We design a simple dual-stream structure, including a temporal layer and a hash layer. We first design a simple dual-stream structure, including a temporal layer and a hash layer. With the help of semantic similarity knowledge obtained from self-supervision, the hash layer learns to capture information for semantic retrieval. In this way, the model naturally preserves the disentangled semantics into binary codes.
arXiv Detail & Related papers (2023-10-12T03:21:12Z)
Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation. An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder. In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.