Related papers: Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization

Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization

URL: http://arxiv.org/abs/2601.16397v1
Date: Fri, 23 Jan 2026 02:01:43 GMT
Title: Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization
Authors: Qianqi Yan, Huy Nguyen, Sumana Srivatsa, Hari Bandi, Xin Eric Wang, Krishnaram Kenthapadi,
Abstract summary: Trustworthy clinical summarization requires fluent generation and transparency about where each statement comes from.<n>We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images.<n>We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment.
Score: 32.47484883374212
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Trustworthy clinical summarization requires not only fluent generation but also transparency about where each statement comes from. We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images, overcoming the limitations of post-hoc or retraining-based methods. We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment. Evaluations on two representative domains: clinician-patient dialogues (CliConSummation) and radiology reports (MIMIC-CXR), show that our approach consistently outperforms embedding-based and self-attribution baselines, improving both text-level and multimodal attribution accuracy (e.g., +15% F1 over embedding baselines). Caption-based attribution achieves competitive performance with raw-image attention while being more lightweight and practical. These findings highlight attention-guided attribution as a promising step toward interpretable and deployable clinical summarization systems.

Related papers

BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation [3.7276397365086233]
BiCLIP is a framework engineered to bolster robustness in medical segmentation.<n>It features a bidirectional multimodal fusion mechanism that enables visual features to iteratively refine textual representations.<n>It exhibits significant resistance to clinical artifacts, including motion blur and low-dose CT noise.
arXiv Detail & Related papers (2026-02-25T18:11:47Z)
Multi-Level CLS Token Fusion for Contrastive Learning in Endoscopy Image Classification [2.5995006632251516]
We present a unified vision-language framework tailored for ENT endoscopy image analysis.<n>It simultaneously tackles three clinically-relevant tasks: image classification, image-to-image retrieval, and text-to-image retrieval.<n>We achieve 95% accuracy and F1-score in classification, Recall@1 of 0.93 and 0.92 for image-to-image and text-to-image retrieval respectively, and MRR scores of 0.97 and 0.96.
arXiv Detail & Related papers (2025-08-31T09:03:39Z)
Redemption Score: A Multi-Modal Evaluation Framework for Image Captioning via Distributional, Perceptual, and Linguistic Signal Triangulation [3.4998703934432682]
Redemption Score(RS) is a novel framework that ranks image captions by triangulating three complementary signals.<n>On the Flickr8k benchmark, RS achieves a Kendall-$tau$ of 58.42, outperforming most prior methods.
arXiv Detail & Related papers (2025-05-22T03:35:12Z)
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity [14.223539927549782]
We propose a novel HybridMED framework to align global-level visual representations with impression and token-level visual representations with findings.<n>Our framework incorporates a generation decoder that employs two proxy tasks, responsible for generating the impression from images, via a captioning branch, and (2) findings, through a summarization branch.<n> Experiments on the MIMIC-CXR dataset reveal that our summarization branch effectively distills knowledge to the captioning branch, enhancing model performance without significantly increasing parameter requirements.
arXiv Detail & Related papers (2024-10-01T07:05:36Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
Iterative Prompt Learning for Unsupervised Backlit Image Enhancement [86.90993077000789]
We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-LIT. We show that the open-world CLIP prior aids in distinguishing between backlit and well-lit images. Our method alternates between updating the prompt learning framework and enhancement network until visually pleasing results are achieved.
arXiv Detail & Related papers (2023-03-30T17:37:14Z)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. We explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP) We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.