Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization
- URL: http://arxiv.org/abs/2601.16397v1
- Date: Fri, 23 Jan 2026 02:01:43 GMT
- Title: Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization
- Authors: Qianqi Yan, Huy Nguyen, Sumana Srivatsa, Hari Bandi, Xin Eric Wang, Krishnaram Kenthapadi,
- Abstract summary: Trustworthy clinical summarization requires fluent generation and transparency about where each statement comes from.<n>We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images.<n>We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment.
- Score: 32.47484883374212
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Trustworthy clinical summarization requires not only fluent generation but also transparency about where each statement comes from. We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images, overcoming the limitations of post-hoc or retraining-based methods. We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment. Evaluations on two representative domains: clinician-patient dialogues (CliConSummation) and radiology reports (MIMIC-CXR), show that our approach consistently outperforms embedding-based and self-attribution baselines, improving both text-level and multimodal attribution accuracy (e.g., +15% F1 over embedding baselines). Caption-based attribution achieves competitive performance with raw-image attention while being more lightweight and practical. These findings highlight attention-guided attribution as a promising step toward interpretable and deployable clinical summarization systems.
Related papers
- BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation [3.7276397365086233]
BiCLIP is a framework engineered to bolster robustness in medical segmentation.<n>It features a bidirectional multimodal fusion mechanism that enables visual features to iteratively refine textual representations.<n>It exhibits significant resistance to clinical artifacts, including motion blur and low-dose CT noise.
arXiv Detail & Related papers (2026-02-25T18:11:47Z) - Multi-Level CLS Token Fusion for Contrastive Learning in Endoscopy Image Classification [2.5995006632251516]
We present a unified vision-language framework tailored for ENT endoscopy image analysis.<n>It simultaneously tackles three clinically-relevant tasks: image classification, image-to-image retrieval, and text-to-image retrieval.<n>We achieve 95% accuracy and F1-score in classification, Recall@1 of 0.93 and 0.92 for image-to-image and text-to-image retrieval respectively, and MRR scores of 0.97 and 0.96.
arXiv Detail & Related papers (2025-08-31T09:03:39Z) - Redemption Score: A Multi-Modal Evaluation Framework for Image Captioning via Distributional, Perceptual, and Linguistic Signal Triangulation [3.4998703934432682]
Redemption Score(RS) is a novel framework that ranks image captions by triangulating three complementary signals.<n>On the Flickr8k benchmark, RS achieves a Kendall-$tau$ of 58.42, outperforming most prior methods.
arXiv Detail & Related papers (2025-05-22T03:35:12Z) - Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity [14.223539927549782]
We propose a novel HybridMED framework to align global-level visual representations with impression and token-level visual representations with findings.<n>Our framework incorporates a generation decoder that employs two proxy tasks, responsible for generating the impression from images, via a captioning branch, and (2) findings, through a summarization branch.<n> Experiments on the MIMIC-CXR dataset reveal that our summarization branch effectively distills knowledge to the captioning branch, enhancing model performance without significantly increasing parameter requirements.
arXiv Detail & Related papers (2024-10-01T07:05:36Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Iterative Prompt Learning for Unsupervised Backlit Image Enhancement [86.90993077000789]
We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-LIT.
We show that the open-world CLIP prior aids in distinguishing between backlit and well-lit images.
Our method alternates between updating the prompt learning framework and enhancement network until visually pleasing results are achieved.
arXiv Detail & Related papers (2023-03-30T17:37:14Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.