PCoA: A New Benchmark for Medical Aspect-Based Summarization With Phrase-Level Context Attribution
- URL: http://arxiv.org/abs/2601.03418v1
- Date: Tue, 06 Jan 2026 21:12:03 GMT
- Title: PCoA: A New Benchmark for Medical Aspect-Based Summarization With Phrase-Level Context Attribution
- Authors: Bohao Chu, Sameh Frihat, Tabea M. G. Pakull, Hendrik Damm, Meijie Li, Ula Muhabbek, Georg Lodde, Norbert Fuhr,
- Abstract summary: PCoA is an expert-annotated benchmark for medical aspect-based summarization with phrase-level context attribution.<n>We propose a fine-grained, decoupled evaluation framework that independently assesses the quality of generated summaries, citations, and contributory phrases.
- Score: 1.4248535198162013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Verifying system-generated summaries remains challenging, as effective verification requires precise attribution to the source context, which is especially crucial in high-stakes medical domains. To address this challenge, we introduce PCoA, an expert-annotated benchmark for medical aspect-based summarization with phrase-level context attribution. PCoA aligns each aspect-based summary with its supporting contextual sentences and contributory phrases within them. We further propose a fine-grained, decoupled evaluation framework that independently assesses the quality of generated summaries, citations, and contributory phrases. Through extensive experiments, we validate the quality and consistency of the PCoA dataset and benchmark several large language models on the proposed task. Experimental results demonstrate that PCoA provides a reliable benchmark for evaluating system-generated summaries with phrase-level context attribution. Furthermore, comparative experiments show that explicitly identifying relevant sentences and contributory phrases before summarization can improve overall quality. The data and code are available at https://github.com/chubohao/PCoA.
Related papers
- DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing [53.85037373860246]
We introduce Deep Synth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities.<n>We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization)<n>Our results demonstrate that agentic plan-and-write significantly outperform single-turn generation.
arXiv Detail & Related papers (2026-01-07T03:07:52Z) - TracSum: A New Benchmark for Aspect-Based Summarization with Sentence-Level Traceability in Medical Domain [1.5732353205551508]
We introduce TracSum, a novel benchmark for traceable, aspect-based summarization.<n> generated summaries are paired with sentence-level citations, enabling users to trace back to the original context.<n>We show that TracSum can serve as an effective benchmark for traceable, aspect-based summarization tasks.
arXiv Detail & Related papers (2025-08-19T12:57:45Z) - Contextual Embedding-based Clustering to Identify Topics for Healthcare Service Improvement [3.9726806016869936]
This study explores unsupervised methods to extract meaningful topics from 439 survey responses collected from a healthcare system in Wisconsin, USA.<n>A keyword-based filtering approach was applied to isolate complaint-related feedback using a domain-specific lexicon.<n>To improve coherence and interpretability where data are scarce and consist of short-texts, we propose kBERT.
arXiv Detail & Related papers (2025-04-18T20:38:24Z) - Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering [68.3400058037817]
We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality.<n>We show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations.
arXiv Detail & Related papers (2025-04-10T09:24:54Z) - PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization [5.5899921245557]
Hallucinated outputs from large language models pose risks in the medical domain.<n>We introduce PlainQAFact, an automatic factual consistency evaluation metric trained on a fine-grained, human-annotated dataset PlainFact.
arXiv Detail & Related papers (2025-03-11T20:59:53Z) - QAPyramid: Fine-grained Evaluation of Content Selection for Text Summarization [62.809455597778616]
We propose QAPyramid, which decomposes each reference summary into finer-grained question-answer pairs according to the QA-SRL framework.<n>We show that, compared to Pyramid, QAPyramid provides more systematic and fine-grained content selection evaluation while maintaining high inter-annotator agreement without needing expert annotations.
arXiv Detail & Related papers (2024-12-10T01:29:51Z) - ALiiCE: Evaluating Positional Fine-grained Citation Generation [54.19617927314975]
We propose ALiiCE, the first automatic evaluation framework for fine-grained citation generation.
Our framework first parses the sentence claim into atomic claims via dependency analysis and then calculates citation quality at the atomic claim level.
We evaluate the positional fine-grained citation generation performance of several Large Language Models on two long-form QA datasets.
arXiv Detail & Related papers (2024-06-19T09:16:14Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - NapSS: Paragraph-level Medical Text Simplification via Narrative
Prompting and Sentence-matching Summarization [46.772517928718216]
We propose a summarize-then-simplify two-stage strategy, which we call NapSS.
NapSS identifies the relevant content to simplify while ensuring that the original narrative flow is preserved.
Our model achieves significantly better than the seq2seq baseline on an English medical corpus.
arXiv Detail & Related papers (2023-02-11T02:20:25Z) - Comparing Methods for Extractive Summarization of Call Centre Dialogue [77.34726150561087]
We experimentally compare several such methods by using them to produce summaries of calls, and evaluating these summaries objectively.
We found that TopicSum and Lead-N outperform the other summarisation methods, whilst BERTSum received comparatively lower scores in both subjective and objective evaluations.
arXiv Detail & Related papers (2022-09-06T13:16:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.