Related papers: UTSA-NLP at ArchEHR-QA 2025: Improving EHR Question Answering via Self-Consistency Prompting

UTSA-NLP at ArchEHR-QA 2025: Improving EHR Question Answering via Self-Consistency Prompting

URL: http://arxiv.org/abs/2506.05589v1
Date: Thu, 05 Jun 2025 21:07:55 GMT
Title: UTSA-NLP at ArchEHR-QA 2025: Improving EHR Question Answering via Self-Consistency Prompting
Authors: Sara Shields-Menard, Zach Reimers, Joshua Gardner, David Perry, Anthony Rios,
Abstract summary: We describe our system for answering clinical questions using electronic health records.<n>Our approach uses large language models in two steps: first, to find sentences relevant to a clinician's question, and second, to generate a short, citation-supported response.
Score: 5.882312167168893
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We describe our system for the ArchEHR-QA Shared Task on answering clinical questions using electronic health records (EHRs). Our approach uses large language models in two steps: first, to find sentences in the EHR relevant to a clinician's question, and second, to generate a short, citation-supported response based on those sentences. We use few-shot prompting, self-consistency, and thresholding to improve the sentence classification step to decide which sentences are essential. We compare several models and find that a smaller 8B model performs better than a larger 70B model for identifying relevant information. Our results show that accurate sentence selection is critical for generating high-quality responses and that self-consistency with thresholding helps make these decisions more reliable.

Related papers

Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering [3.3260862557368926]
We present Neural, the runner-up in the BioNLP 2025 Arch-QA shared task on evidence-grounded clinical QA.<n>Our proposed method decouples the task into (1) sentence-level evidence identification and (2) answer synthesis with explicit citations.<n>A self-consistency voting scheme further improves evidence recall without sacrificing precision.
arXiv Detail & Related papers (2025-06-12T14:36:18Z)
A Dataset for Addressing Patient's Information Needs related to Clinical Course of Hospitalization [15.837772594006038]
ArchEHR-QA is an expert-annotated dataset based on real-world patient cases from intensive care unit and emergency department settings.<n>Cases comprise questions posed by patients to public health forums, clinician-interpreted counterparts, relevant clinical note excerpts with sentence-level relevance annotations, and clinician-authored answers.<n>The answer-first prompting approach consistently performed best, with Llama 4 achieving the highest scores.
arXiv Detail & Related papers (2025-06-04T16:55:08Z)
Give me Some Hard Questions: Synthetic Data Generation for Clinical QA [13.436187152293515]
This paper explores generating Clinical QA data using large language models (LLMs) in a zero-shot setting.<n>We find that naive prompting often results in easy questions that do not reflect the complexity of clinical scenarios.<n>Experiments on two Clinical QA datasets demonstrate that our method generates more challenging questions, significantly improving fine-tuning performance over baselines.
arXiv Detail & Related papers (2024-12-05T19:35:41Z)
Aligning Large Language Models by On-Policy Self-Judgment [49.31895979525054]
Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning. We present a novel alignment framework, SELF-JUDGE, that does on-policy learning and is parameter efficient. We show that the rejecting sampling by itself can improve performance further without an additional evaluator.
arXiv Detail & Related papers (2024-02-17T11:25:26Z)
PEDANTS: Cheap but Effective and Interpretable Answer Equivalence [10.367359022491181]
We provide rubrics and datasets for evaluating machine QA adopted from the Trivia community. We also propose an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods(BERTScore)
arXiv Detail & Related papers (2024-02-17T01:56:19Z)
MinPrompt: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering [64.6741991162092]
We present MinPrompt, a minimal data augmentation framework for open-domain question answering. We transform the raw text into a graph structure to build connections between different factual sentences. We then apply graph algorithms to identify the minimal set of sentences needed to cover the most information in the raw text. We generate QA pairs based on the identified sentence subset and train the model on the selected sentences to obtain the final model.
arXiv Detail & Related papers (2023-10-08T04:44:36Z)
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation) We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z)
Double Retrieval and Ranking for Accurate Question Answering [120.69820139008138]
We show that an answer verification step introduced in Transformer-based answer selection models can significantly improve the state of the art in Question Answering. The results on three well-known datasets for AS2 show consistent and significant improvement of the state of the art.
arXiv Detail & Related papers (2022-01-16T06:20:07Z)
MS-Ranker: Accumulating Evidence from Potentially Correct Candidates for Answer Selection [59.95429407899612]
We propose a novel reinforcement learning based multi-step ranking model, named MS-Ranker. We explicitly consider the potential correctness of candidates and update the evidence with a gating mechanism. Our model significantly outperforms existing methods that do not rely on external resources.
arXiv Detail & Related papers (2020-10-10T10:36:58Z)
Harvesting and Refining Question-Answer Pairs for Unsupervised QA [95.9105154311491]
We introduce two approaches to improve unsupervised Question Answering (QA) First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA) Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA.
arXiv Detail & Related papers (2020-05-06T15:56:06Z)
A Study on Efficiency, Accuracy and Document Structure for Answer Sentence Selection [112.0514737686492]
In this paper, we argue that by exploiting the intrinsic structure of the original rank together with an effective word-relatedness encoder, we can achieve competitive results. Our model takes 9.5 seconds to train on the WikiQA dataset, i.e., very fast in comparison with the $sim 18$ minutes required by a standard BERT-base fine-tuning.
arXiv Detail & Related papers (2020-03-04T22:12:18Z)
Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU) We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.