BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models
- URL: http://arxiv.org/abs/2510.20095v2
- Date: Fri, 24 Oct 2025 01:51:09 GMT
- Title: BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models
- Authors: Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G. Campolongo, Matthew J. Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu,
- Abstract summary: Images and captions can be viewed as complementary samples from the latent morphospace of a species.<n>We generate synthetic captions with Wikipedia-derived visual information and taxon-tailored format examples.<n>These domain-specific contexts help reduce hallucination and yield accurate, instance-based captions.
- Score: 40.106880795877466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific contexts help reduce hallucination and yield accurate, instance-based descriptive captions. Using these captions, we train BioCAP (i.e., BioCLIP with Captions), a biological foundation model that captures rich semantics and achieves strong performance in species classification and text-image retrieval. These results demonstrate the value of descriptive captions beyond labels in bridging biological images with multimodal foundation models.
Related papers
- Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy [1.7429354559347476]
We propose a label-mediated method that generates meaningful captions from images by linking images and text only through a label.<n>Across 57 brain areas, the resulting method produces plausible area-level descriptions and supports open-set use through explicit rejection of unseen areas.
arXiv Detail & Related papers (2026-02-26T15:10:39Z) - Visual-textual Dermatoglyphic Animal Biometrics: A First Case Study on Panthera tigris [11.07566750390282]
We extend Re-ID methodologies by incorporating precise dermatoglyphic textual descriptors.<n>We show that these specialist semantics abstract and encode animal coat topology using human-interpretable language tags.<n>We conclude that dermatoglyphic language-guided biometrics can overcome vision-only limitations.
arXiv Detail & Related papers (2025-12-16T19:47:02Z) - Hyperbolic Multimodal Representation Learning for Biological Taxonomies [23.639218053531962]
Taxonomic classification in biodiversity research involves organizing biological specimens into structured hierarchies based on evidence.<n>We investigate whether hyperbolic networks can provide a better embedding space for such hierarchical models.<n>Our method embeds multimodal inputs into a shared hyperbolic space using contrastive and a novel stacked entailment-based objective.
arXiv Detail & Related papers (2025-08-22T18:52:50Z) - Enhancing Biomedical Multi-modal Representation Learning with Multi-scale Pre-training and Perturbed Report Discrimination [13.654729300824227]
Vision-language models pre-trained on large scale of unlabeled biomedical images learn generalizable semantic representations.<n>We propose a novel method, perturbed report discrimination, for pre-train biomedical vision-language models.
arXiv Detail & Related papers (2025-06-02T17:23:25Z) - BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning [60.80381372245902]
We find emergent behaviors in biological vision models via large-scale contrastive vision-language training.<n>We train BioCLIP 2 on TreeOfLife-200M to distinguish different species.<n>We identify emergent properties in the learned embedding space of BioCLIP 2.
arXiv Detail & Related papers (2025-05-29T17:48:20Z) - CrypticBio: A Large Multimodal Dataset for Visually Confusing Biodiversity [3.73232466691291]
We present CrypticBio, the largest publicly available dataset of visually confusing species.<n>Criticized from real-world trends in species misidentification among community annotators of iNaturalist, CrypticBio contains 52K unique cryptic groups spanning 67K species.
arXiv Detail & Related papers (2025-05-16T14:35:56Z) - Taxonomic Reasoning for Rare Arthropods: Combining Dense Image Captioning and RAG for Interpretable Classification [12.923336716880506]
We integrate image captioning and retrieval-augmented generation (RAG) with large language models (LLMs) to enhance biodiversity monitoring.<n>Our findings highlight the potential for modern vision-language AI pipelines to support biodiversity conservation initiatives.
arXiv Detail & Related papers (2025-03-13T21:18:10Z) - What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans.<n>We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z) - Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images.
Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z) - Text-guided Foundation Model Adaptation for Pathological Image
Classification [40.45252665455015]
We propose to connect image and text Embeddings (CITE) to enhance pathological image classification.
CITE injects text insights gained from language models pre-trained with a broad range of biomedical texts, leading to adapt foundation models towards pathological image understanding.
arXiv Detail & Related papers (2023-07-27T14:44:56Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.