RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies
- URL: http://arxiv.org/abs/2506.18819v1
- Date: Mon, 23 Jun 2025 16:28:03 GMT
- Title: RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies
- Authors: Arjun Mukerji, Michael L. Jackson, Jason Jones, Neil Sanghavi,
- Abstract summary: Large Language Models (LLMs) have been extensively evaluated for general summarization tasks as well as medical research assistance.<n>We introduce RWESummary, a proposed addition to the MedHELM framework to enable benchmarking of LLMs for this task.<n>RWESummary includes one scenario and three evaluations covering major types of errors observed in summarization of medical research studies.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large Language Models (LLMs) have been extensively evaluated for general summarization tasks as well as medical research assistance, but they have not been specifically evaluated for the task of summarizing real-world evidence (RWE) from structured output of RWE studies. We introduce RWESummary, a proposed addition to the MedHELM framework (Bedi, Cui, Fuentes, Unell et al., 2025) to enable benchmarking of LLMs for this task. RWESummary includes one scenario and three evaluations covering major types of errors observed in summarization of medical research studies and was developed using Atropos Health proprietary data. Additionally, we use RWESummary to compare the performance of different LLMs in our internal RWE summarization tool. At the time of publication, with 13 distinct RWE studies, we found the Gemini 2.5 models performed best overall (both Flash and Pro). We suggest RWESummary as a novel and useful foundation model benchmark for real-world evidence study summarization.
Related papers
- 100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models [58.98176123850354]
The recent release of DeepSeek-R1 has generated widespread social impact and sparked enthusiasm in the research community for exploring the explicit reasoning paradigm of language models.<n>The implementation details of the released models have not been fully open-sourced by DeepSeek, including DeepSeek-R1-Zero, DeepSeek-R1, and the distilled small models.<n>Many replication studies have emerged aiming to reproduce the strong performance achieved by DeepSeek-R1, reaching comparable performance through similar training procedures and fully open-source data resources.
arXiv Detail & Related papers (2025-05-01T14:28:35Z) - Scenario-Wise Rec: A Multi-Scenario Recommendation Benchmark [54.93461228053298]
We introduce our benchmark, textbfScenario-Wise Rec, which comprises 6 public datasets and 12 benchmark models, along with a training and evaluation pipeline.<n>We aim for this benchmark to offer researchers valuable insights from prior work, enabling the development of novel models.
arXiv Detail & Related papers (2024-12-23T08:15:34Z) - Lessons Learned on Information Retrieval in Electronic Health Records: A Comparison of Embedding Models and Pooling Strategies [8.822087602255504]
Applying large language models to the clinical domain is challenging due to the context-heavy nature of processing medical records.
This paper explores how different embedding models and pooling methods affect information retrieval for the clinical domain.
arXiv Detail & Related papers (2024-09-23T16:16:08Z) - R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models [51.468732121824125]
Large language models have achieved remarkable success on general NLP tasks, but they may fall short for domain-specific problems.
Existing evaluation tools only provide a few baselines and evaluate them on various domains without mining the depth of domain knowledge.
In this paper, we address the challenges of evaluating RALLMs by introducing the R-Eval toolkit, a Python toolkit designed to streamline the evaluation of different RAGs.
arXiv Detail & Related papers (2024-06-17T15:59:49Z) - ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents [21.17856299966841]
This study introduces ResearchArena, a benchmark designed to evaluate large language models (LLMs) in conducting academic surveys.<n>To support these opportunities, we construct an environment of 12M full-text academic papers and 7.9K survey papers.
arXiv Detail & Related papers (2024-06-13T03:26:30Z) - RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing [0.2302001830524133]
This survey paper addresses the absence of a comprehensive overview on Retrieval-Augmented Language Models (RALMs)
The paper discusses the essential components of RALMs, including Retrievers, Language Models, and Augmentations.
RALMs demonstrate utility in a spectrum of tasks, from translation and dialogue systems to knowledge-intensive applications.
arXiv Detail & Related papers (2024-04-30T13:14:51Z) - Summarization is (Almost) Dead [49.360752383801305]
We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of large language models (LLMs)
Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models.
arXiv Detail & Related papers (2023-09-18T08:13:01Z) - Inverse Reinforcement Learning for Text Summarization [52.765898203824975]
We introduce inverse reinforcement learning (IRL) as an effective paradigm for training abstractive summarization models.
Experimental results across datasets in different domains demonstrate the superiority of our proposed IRL model for summarization over MLE and RL baselines.
arXiv Detail & Related papers (2022-12-19T23:45:05Z) - Image-text Retrieval: A Survey on Recent Research and Development [58.060687870247996]
Cross-modal image-text retrieval (ITR) has experienced increased interest in the research community due to its excellent research value and broad real-world application.
This paper presents a comprehensive and up-to-date survey on the ITR approaches from four perspectives.
arXiv Detail & Related papers (2022-03-28T13:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.