Related papers: A Course Shared Task on Evaluating LLM Output for Clinical Questions

A Course Shared Task on Evaluating LLM Output for Clinical Questions

URL: http://arxiv.org/abs/2408.00122v1
Date: Wed, 31 Jul 2024 19:24:40 GMT
Title: A Course Shared Task on Evaluating LLM Output for Clinical Questions
Authors: Yufang Hou, Thy Thy Tran, Doan Nam Long Vu, Yiwen Cao, Kai Li, Lukas Rohde, Iryna Gurevych,
Abstract summary: This paper focuses on evaluating the output of Large Language Models (LLMs) in generating harmful answers to health-related clinical questions. We describe the task design considerations and report the feedback we received from the students.
Score: 49.78601596538669
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper presents a shared task that we organized at the Foundations of Language Technology (FoLT) course in 2023/2024 at the Technical University of Darmstadt, which focuses on evaluating the output of Large Language Models (LLMs) in generating harmful answers to health-related clinical questions. We describe the task design considerations and report the feedback we received from the students. We expect the task and the findings reported in this paper to be relevant for instructors teaching natural language processing (NLP) and designing course assignments.

Related papers

Polish Medical Exams: A new dataset for cross-lingual medical knowledge transfer assessment [0.865489625605814]
This study introduces a novel benchmark dataset based on Polish medical licensing and specialization exams. It comprises over 24,000 exam questions, including a subset of parallel Polish-English corpora. We evaluate state-of-the-art LLMs, including general-purpose, domain-specific, and Polish-specific models, and compare their performance against human medical students.
arXiv Detail & Related papers (2024-11-30T19:02:34Z)
Demystifying Large Language Models for Medicine: A Primer [50.83806796466396]
Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare. This tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice.
arXiv Detail & Related papers (2024-10-24T15:41:56Z)
HealthQ: Unveiling Questioning Capabilities of LLM Chains in Healthcare Conversations [23.09755446991835]
In digital healthcare, large language models (LLMs) have primarily been utilized to enhance question-answering capabilities. This paper presents HealthQ, a novel framework designed to evaluate the questioning capabilities of LLM healthcare chains.
arXiv Detail & Related papers (2024-09-28T23:59:46Z)
From Multiple-Choice to Extractive QA: A Case Study for English and Arabic [51.13706104333848]
We explore the feasibility of repurposing an existing multilingual dataset for a new NLP task. We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic. We aim to help others adapt our approach for the remaining 120 BELEBELE language variants, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z)
SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials [13.59675117792588]
We present SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for ClinicalTrials. Our contributions include the refined NLI4CT-P dataset (i.e., Natural Language Inference for Clinical Trials - Perturbed) A total of 106 participants registered for the task contributing to over 1200 individual submissions and 25 system overview papers. This initiative aims to advance the robustness and applicability of NLI models in healthcare, ensuring safer and more dependable AI assistance in clinical decision-making.
arXiv Detail & Related papers (2024-04-07T13:58:41Z)
Overview of the PromptCBLUE Shared Task in CHIP2023 [26.56584015791646]
This paper presents an overview of the PromptC BLUE shared task held in the CHIP-2023 Conference. It provides a good testbed for Chinese open-domain or medical-domain large language models (LLMs) in general medical natural language processing. This paper describes the tasks, the datasets, evaluation metrics, and the top systems for both tasks.
arXiv Detail & Related papers (2023-12-29T09:05:00Z)
A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks [7.542019351929903]
We evaluate four state-of-the-art instruction-tuned large language models (LLMs) On a set of 13 real-world clinical and biomedical natural language processing (NLP) tasks in English.
arXiv Detail & Related papers (2023-07-22T15:58:17Z)
Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A Preliminary Study on Writing Assistance [60.40541387785977]
Small foundational models can display remarkable proficiency in tackling diverse tasks when fine-tuned using instruction-driven data. In this work, we investigate a practical problem setting where the primary focus is on one or a few particular tasks rather than general-purpose instruction following. Experimental results show that fine-tuning LLaMA on writing instruction data significantly improves its ability on writing tasks.
arXiv Detail & Related papers (2023-05-22T16:56:44Z)
Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding [12.128991867050487]
Large language models (LLMs) have made significant progress in various domains, including healthcare. In this study, we evaluate state-of-the-art LLMs within the realm of clinical language understanding tasks.
arXiv Detail & Related papers (2023-04-09T16:31:47Z)
Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical Medicine [68.7814360102644]
We propose the Re$3$Writer method with retrieval-augmented generation and knowledge-grounded reasoning. We demonstrate the effectiveness of our method in generating patient discharge instructions.
arXiv Detail & Related papers (2022-10-23T16:34:39Z)
ITTC @ TREC 2021 Clinical Trials Track [54.141379782822206]
The task focuses on the problem of matching eligible clinical trials to topics constituting a summary of a patient's admission notes. We explore different ways of representing trials and topics using NLP techniques, and then use a common retrieval model to generate the ranked list of relevant trials for each topic. The results from all our submitted runs are well above the median scores for all topics, but there is still plenty of scope for improvement.
arXiv Detail & Related papers (2022-02-16T04:56:47Z)
Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching. We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders. We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.