Related papers: HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

URL: http://arxiv.org/abs/2506.06035v2
Date: Sat, 05 Jul 2025 04:11:40 GMT
Title: HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion
Authors: Shiyi Zhang, Dong Liang, Hairong Zheng, Yihang Zhou,
Abstract summary: Reconstructing visual information from brain activity bridges the gap between neuroscience and computer vision.<n>HAVIR reconstructs both structural features and semantic information of visual stimuli even in complex scenarios.
Score: 3.9136086794667597
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Reconstructing visual information from brain activity bridges the gap between neuroscience and computer vision. Even though progress has been made in decoding images from fMRI using generative models, a challenge remains in accurately recovering highly complex visual stimuli. This difficulty stems from their elemental density and diversity, sophisticated spatial structures, and multifaceted semantic information. To address these challenges, we propose HAVIR that contains two adapters: (1) The AutoKL Adapter transforms fMRI voxels into a latent diffusion prior, capturing topological structures; (2) The CLIP Adapter converts the voxels to CLIP text and image embeddings, containing semantic information. These complementary representations are fused by Versatile Diffusion to generate the final reconstructed image. To extract the most essential semantic information from complex scenarios, the CLIP Adapter is trained with text captions describing the visual stimuli and their corresponding semantic images synthesized from these captions. The experimental results demonstrate that HAVIR effectively reconstructs both structural features and semantic information of visual stimuli even in complex scenarios, outperforming existing models.

Related papers

MegaSR: Mining Customized Semantics and Expressive Guidance for Image Super-Resolution [76.30559905769859]
MegaSR mines customized block-wise semantics and expressive guidance for diffusion-based ISR.<n>We experimentally identify HED edge maps, depth maps, and segmentation maps as the most expressive guidance.<n>Extensive experiments demonstrate the superiority of MegaSR in terms of semantic richness and structural consistency.
arXiv Detail & Related papers (2025-03-11T07:00:20Z)
Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance [3.74142789780782]
We show how modern LDMs incorporate multi-modal guidance for structurally and semantically plausible image generations. Brain-Streams maps fMRI signals from brain regions to appropriate embeddings. We validate the reconstruction ability of Brain-Streams both quantitatively and qualitatively on a real fMRI dataset.
arXiv Detail & Related papers (2024-09-18T16:19:57Z)
MindFormer: Semantic Alignment of Multi-Subject fMRI for Brain Decoding [50.55024115943266]
We introduce a novel semantic alignment method of multi-subject fMRI signals using so-called MindFormer. This model is specifically designed to generate fMRI-conditioned feature vectors that can be used for conditioning Stable Diffusion model for fMRI- to-image generation or large language model (LLM) for fMRI-to-text generation. Our experimental results demonstrate that MindFormer generates semantically consistent images and text across different subjects.
arXiv Detail & Related papers (2024-05-28T00:36:25Z)
UGMAE: A Unified Framework for Graph Masked Autoencoders [67.75493040186859]
We propose UGMAE, a unified framework for graph masked autoencoders. We first develop an adaptive feature mask generator to account for the unique significance of nodes. We then design a ranking-based structure reconstruction objective joint with feature reconstruction to capture holistic graph information.
arXiv Detail & Related papers (2024-02-12T19:39:26Z)
MindDiffuser: Controlled Image Reconstruction from Human Brain Activity with Semantic and Structural Diffusion [7.597218661195779]
We propose a two-stage image reconstruction model called MindDiffuser. In Stage 1, the VQ-VAE latent representations and the CLIP text embeddings decoded from fMRI are put into Stable Diffusion. In Stage 2, we utilize the CLIP visual feature decoded from fMRI as supervisory information, and continually adjust the two feature vectors decoded in Stage 1 through backpagation to align the structural information.
arXiv Detail & Related papers (2023-08-08T13:28:34Z)
Joint fMRI Decoding and Encoding with Latent Embedding Alignment [77.66508125297754]
We introduce a unified framework that addresses both fMRI decoding and encoding. Our model concurrently recovers visual stimuli from fMRI signals and predicts brain activity from images within a unified framework.
arXiv Detail & Related papers (2023-03-26T14:14:58Z)
MindDiffuser: Controlled Image Reconstruction from Human Brain Activity with Semantic and Structural Diffusion [8.299415606889024]
We propose a two-stage image reconstruction model called MindDiffuser. In Stage 1, the VQ-VAE latent representations and the CLIP text embeddings decoded from fMRI are put into the image-to-image process of Stable Diffusion. In Stage 2, we utilize the low-level CLIP visual features decoded from fMRI as supervisory information.
arXiv Detail & Related papers (2023-03-24T16:41:42Z)
BrainCLIP: Bridging Brain and Visual-Linguistic Representation Via CLIP for Generic Natural Visual Stimulus Decoding [51.911473457195555]
BrainCLIP is a task-agnostic fMRI-based brain decoding model. It bridges the modality gap between brain activity, image, and text. BrainCLIP can reconstruct visual stimuli with high semantic fidelity.
arXiv Detail & Related papers (2023-02-25T03:28:54Z)
SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization [59.732036564862796]
We propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning. The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily. Experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks.
arXiv Detail & Related papers (2022-08-31T03:00:07Z)
Facial Image Reconstruction from Functional Magnetic Resonance Imaging via GAN Inversion with Improved Attribute Consistency [5.705640492618758]
We propose a new framework to reconstruct facial images from fMRI data. The proposed framework accomplishes two goals: (1) reconstructing clear facial images from fMRI data and (2) maintaining the consistency of semantic characteristics.
arXiv Detail & Related papers (2022-07-03T11:18:35Z)
Two-stage Visual Cues Enhancement Network for Referring Image Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression. In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net) Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z)
Reconstructing Perceptive Images from Brain Activity by Shape-Semantic GAN [16.169414324390218]
Reconstructing seeing images from fMRI recordings is an absorbing research area in neuroscience. Visual encoding in brain is highly complex and not fully revealed. Inspired by the theory that visual features are hierarchically represented in cortex, we propose to break the complex visual signals into multi-level components.
arXiv Detail & Related papers (2021-01-28T16:04:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.