Related papers: Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy

Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy

URL: http://arxiv.org/abs/2602.23088v1
Date: Thu, 26 Feb 2026 15:10:39 GMT
Title: Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy
Authors: Matthew Sutton, Katrin Amunts, Timo Dickscheid, Christian Schiffer,
Abstract summary: We propose a label-mediated method that generates meaningful captions from images by linking images and text only through a label.<n>Across 57 brain areas, the resulting method produces plausible area-level descriptions and supports open-set use through explicit rejection of unseen areas.
Score: 1.7429354559347476
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Foundation models increasingly offer potential to support interactive, agentic workflows that assist researchers during analysis and interpretation of image data. Such workflows often require coupling vision to language to provide a natural-language interface. However, paired image-text data needed to learn this coupling are scarce and difficult to obtain in many research and clinical settings. One such setting is microscopic analysis of cell-body-stained histological human brain sections, which enables the study of cytoarchitecture: cell density and morphology and their laminar and areal organization. Here, we propose a label-mediated method that generates meaningful captions from images by linking images and text only through a label, without requiring curated paired image-text data. Given the label, we automatically mine area descriptions from related literature and use them as synthetic captions reflecting canonical cytoarchitectonic attributes. An existing cytoarchitectonic vision foundation model (CytoNet) is then coupled to a large language model via an image-to-text training objective, enabling microscopy regions to be described in natural language. Across 57 brain areas, the resulting method produces plausible area-level descriptions and supports open-set use through explicit rejection of unseen areas. It matches the cytoarchitectonic reference label for in-scope patches with 90.6% accuracy and, with the area label masked, its descriptions remain discriminative enough to recover the area in an 8-way test with 68.6% accuracy. These results suggest that weak, label-mediated pairing can suffice to connect existing biomedical vision foundation models to language, providing a practical recipe for integrating natural-language in domains where fine-grained paired annotations are scarce.

Related papers

Rescind: Countering Image Misconduct in Biomedical Publications with Vision-Language and State-Space Modeling [8.024142807011378]
We present the first vision-language guided framework for both generating and detecting biomedical image forgeries.<n>By combining diffusion-based synthesis with vision-language prompting, our method enables realistic and semantically controlled manipulations.<n>Integscan achieves state of the art performance in both detection and localization, establishing a strong foundation for automated scientific integrity analysis.
arXiv Detail & Related papers (2026-01-12T22:13:58Z)
Plasticine: A Traceable Diffusion Model for Medical Image Translation [79.39689106440389]
We propose Plasticine, to the best of our knowledge, the first end-to-end image-to-image translation framework explicitly designed with traceability as a core objective.<n>Our method combines intensity translation and spatial transformation within a denoising diffusion framework.<n>This design enables the generation of synthetic images with interpretable intensity transitions and spatially coherent deformations, supporting pixel-wise traceability throughout the translation process.
arXiv Detail & Related papers (2025-12-20T18:01:57Z)
From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature [86.7745150269054]
We introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature.<n>Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels.<n>We develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases.
arXiv Detail & Related papers (2025-12-02T09:37:51Z)
BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models [40.106880795877466]
Images and captions can be viewed as complementary samples from the latent morphospace of a species.<n>We generate synthetic captions with Wikipedia-derived visual information and taxon-tailored format examples.<n>These domain-specific contexts help reduce hallucination and yield accurate, instance-based captions.
arXiv Detail & Related papers (2025-10-23T00:34:21Z)
BiomedParse: a biomedical foundation model for image parsing of everything everywhere all at once [58.41069132627823]
holistic image analysis comprises subtasks such as segmentation, detection, and recognition of relevant objects. Here, we propose BiomedParse, a biomedical foundation model for imaging parsing that can jointly conduct segmentation, detection, and recognition for 82 object types across 9 imaging modalities. Through joint learning, we can improve accuracy for individual tasks and enable novel applications such as segmenting all relevant objects in a noisy image through a text prompt.
arXiv Detail & Related papers (2024-05-21T17:54:06Z)
Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z)
Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images. Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z)
Integrating curation into scientific publishing to train AI models [1.6982459897303823]
We have embedded multimodal data curation into the academic publishing process to annotate segmented figure panels and captions. The dataset, SourceData-NLP, contains more than 620,000 annotated biomedical entities. We evaluate the utility of the dataset to train AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task.
arXiv Detail & Related papers (2023-10-31T13:22:38Z)
Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing [17.96645738679543]
We show that textual semantic modelling can substantially improve contrastive learning in self-supervised vision--language processing. We propose a self-supervised joint vision--language approach with a focus on better text modelling.
arXiv Detail & Related papers (2022-04-21T00:04:35Z)
Clinical Named Entity Recognition using Contextualized Token Representations [49.036805795072645]
This paper introduces the technique of contextualized word embedding to better capture the semantic meaning of each word based on its context. We pre-train two deep contextualized language models, Clinical Embeddings from Language Model (C-ELMo) and Clinical Contextual String Embeddings (C-Flair) Explicit experiments show that our models gain dramatic improvements compared to both static word embeddings and domain-generic language models.
arXiv Detail & Related papers (2021-06-23T18:12:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.