Related papers: Historical Report Guided Bi-modal Concurrent Learning for Pathology Report Generation

Historical Report Guided Bi-modal Concurrent Learning for Pathology Report Generation

URL: http://arxiv.org/abs/2506.18658v1
Date: Mon, 23 Jun 2025 14:00:21 GMT
Title: Historical Report Guided Bi-modal Concurrent Learning for Pathology Report Generation
Authors: Ling Zhang, Boxiang Yun, Qingli Li, Yan Wang,
Abstract summary: Historical Report Guided textbfBi-modal Concurrent Learning Framework for Pathology Report textbfGeneration (BiGen) emulating pathologists' diagnostic reasoning.<n>BiGen retrieves WSI-relevant knowledge from pre-built medical knowledge bank by matching high-attention patches.<n>Our framework achieves state-of-the-art performance with 7.4% relative improvement in NLP metrics and 19.1% enhancement in classification metrics for Her-2 prediction versus existing methods.
Score: 14.8602760818616
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated pathology report generation from Whole Slide Images (WSIs) faces two key challenges: (1) lack of semantic content in visual features and (2) inherent information redundancy in WSIs. To address these issues, we propose a novel Historical Report Guided \textbf{Bi}-modal Concurrent Learning Framework for Pathology Report \textbf{Gen}eration (BiGen) emulating pathologists' diagnostic reasoning, consisting of: (1) A knowledge retrieval mechanism to provide rich semantic content, which retrieves WSI-relevant knowledge from pre-built medical knowledge bank by matching high-attention patches and (2) A bi-modal concurrent learning strategy instantiated via a learnable visual token and a learnable textual token to dynamically extract key visual features and retrieved knowledge, where weight-shared layers enable cross-modal alignment between visual features and knowledge features. Our multi-modal decoder integrates both modals for comprehensive diagnostic reports generation. Experiments on the PathText (BRCA) dataset demonstrate our framework's superiority, achieving state-of-the-art performance with 7.4\% relative improvement in NLP metrics and 19.1\% enhancement in classification metrics for Her-2 prediction versus existing methods. Ablation studies validate the necessity of our proposed modules, highlighting our method's ability to provide WSI-relevant rich semantic content and suppress information redundancy in WSIs. Code is publicly available at https://github.com/DeepMed-Lab-ECNU/BiGen.

Related papers

Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning [27.49826980862286]
We propose an in-context learning framework called PathGenIC that integrates context derived from the training set with a multimodal in-context learning mechanism.<n>Our method dynamically retrieves semantically similar whole slide representations (WSI)-report pairs and incorporates adaptive feedback to enhance contextual relevance and generation quality.
arXiv Detail & Related papers (2025-06-21T08:56:45Z)
Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering [75.12322966980003]
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains.<n>Most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning.<n>Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering.<n>We propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA.
arXiv Detail & Related papers (2025-06-11T12:03:52Z)
Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation [81.18701211912779]
We introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework.<n>This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings.<n>Our method has achieved state-of-the-art performance on two common datasets.
arXiv Detail & Related papers (2024-12-24T16:38:04Z)
A Multimodal Approach Combining Structural and Cross-domain Textual Guidance for Weakly Supervised OCT Segmentation [12.948027961485536]
We propose a novel Weakly Supervised Semantic (WSSS) approach that integrates structural guidance with text-driven strategies to generate high-quality pseudo labels. Our method achieves state-of-the-art performance, highlighting its potential to improve diagnostic accuracy and efficiency in medical imaging.
arXiv Detail & Related papers (2024-11-19T16:20:27Z)
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity [14.223539927549782]
We propose a novel HybridMED framework to align global-level visual representations with impression and token-level visual representations with findings.<n>Our framework incorporates a generation decoder that employs two proxy tasks, responsible for generating the impression from images, via a captioning branch, and (2) findings, through a summarization branch.<n> Experiments on the MIMIC-CXR dataset reveal that our summarization branch effectively distills knowledge to the captioning branch, enhancing model performance without significantly increasing parameter requirements.
arXiv Detail & Related papers (2024-10-01T07:05:36Z)
Attend and Enrich: Enhanced Visual Prompt for Zero-Shot Learning [114.59476118365266]
We propose AENet, which endows semantic information into the visual prompt to distill semantic-enhanced prompt for visual representation enrichment.<n> AENet comprises two key steps: 1) exploring the concept-harmonized tokens for the visual and attribute modalities, grounded on the modal-sharing token that represents consistent visual-semantic concepts; and 2) yielding semantic-enhanced prompt via the visual residual refinement unit with attribute consistency supervision.
arXiv Detail & Related papers (2024-06-05T07:59:48Z)
Dynamic Traceback Learning for Medical Report Generation [12.746275623663289]
This study proposes a novel multi-modal dynamic traceback learning framework (DTrace) for medical report generation. We introduce a traceback mechanism to supervise the semantic validity of generated content and a dynamic learning strategy to adapt to various proportions of image and text input. The proposed DTrace framework outperforms state-of-the-art methods for medical report generation.
arXiv Detail & Related papers (2024-01-24T07:13:06Z)
Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attribution [49.762034744605955]
We propose a multi-modal information bottleneck approach to improve interpretability of vision-language models. We demonstrate how M2IB can be applied to attribution analysis of vision-language pretrained models.
arXiv Detail & Related papers (2023-12-28T18:02:22Z)
Exploiting Contextual Target Attributes for Target Sentiment Classification [53.30511968323911]
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task. We present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes.
arXiv Detail & Related papers (2023-12-21T11:45:28Z)
CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval [66.93563107820687]
We introduce a CLIP-based Synergistic Knowledge Transfer (CSKT) approach for Person Retrieval (TPR) To explore the CLIP's knowledge on input side, we first propose a Bidirectional Prompts Transferring (BPT) module constructed by text-to-image and image-to-text bidirectional prompts and coupling projections. CSKT outperforms the state-of-the-art approaches across three benchmark datasets when the training parameters merely account for 7.4% of the entire model.
arXiv Detail & Related papers (2023-09-18T05:38:49Z)
Medical Knowledge-enriched Textual Entailment Framework [5.493804101940195]
We present a novel Medical Knowledge-Enriched Textual Entailment framework. We evaluate our framework on the benchmark MEDIQA-RQE dataset and manifest that the use of knowledge enriched dual-encoding mechanism help in achieving an absolute improvement of 8.27% over SOTA language models.
arXiv Detail & Related papers (2020-11-10T17:25:27Z)
Relational Graph Learning on Visual and Kinematics Embeddings for Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information. The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.