A Course Shared Task on Evaluating LLM Output for Clinical Questions
- URL: http://arxiv.org/abs/2408.00122v1
- Date: Wed, 31 Jul 2024 19:24:40 GMT
- Title: A Course Shared Task on Evaluating LLM Output for Clinical Questions
- Authors: Yufang Hou, Thy Thy Tran, Doan Nam Long Vu, Yiwen Cao, Kai Li, Lukas Rohde, Iryna Gurevych,
- Abstract summary: This paper focuses on evaluating the output of Large Language Models (LLMs) in generating harmful answers to health-related clinical questions.
We describe the task design considerations and report the feedback we received from the students.
- Score: 49.78601596538669
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper presents a shared task that we organized at the Foundations of Language Technology (FoLT) course in 2023/2024 at the Technical University of Darmstadt, which focuses on evaluating the output of Large Language Models (LLMs) in generating harmful answers to health-related clinical questions. We describe the task design considerations and report the feedback we received from the students. We expect the task and the findings reported in this paper to be relevant for instructors teaching natural language processing (NLP) and designing course assignments.
Related papers
- Demystifying Large Language Models for Medicine: A Primer [50.83806796466396]
Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare.
This tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice.
arXiv Detail & Related papers (2024-10-24T15:41:56Z) - HealthQ: Unveiling Questioning Capabilities of LLM Chains in Healthcare Conversations [23.09755446991835]
In digital healthcare, large language models (LLMs) have primarily been utilized to enhance question-answering capabilities.
This paper presents HealthQ, a novel framework designed to evaluate the questioning capabilities of LLM healthcare chains.
arXiv Detail & Related papers (2024-09-28T23:59:46Z) - SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials [13.59675117792588]
We present SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for ClinicalTrials.
Our contributions include the refined NLI4CT-P dataset (i.e., Natural Language Inference for Clinical Trials - Perturbed)
A total of 106 participants registered for the task contributing to over 1200 individual submissions and 25 system overview papers.
This initiative aims to advance the robustness and applicability of NLI models in healthcare, ensuring safer and more dependable AI assistance in clinical decision-making.
arXiv Detail & Related papers (2024-04-07T13:58:41Z) - Overview of the PromptCBLUE Shared Task in CHIP2023 [26.56584015791646]
This paper presents an overview of the PromptC BLUE shared task held in the CHIP-2023 Conference.
It provides a good testbed for Chinese open-domain or medical-domain large language models (LLMs) in general medical natural language processing.
This paper describes the tasks, the datasets, evaluation metrics, and the top systems for both tasks.
arXiv Detail & Related papers (2023-12-29T09:05:00Z) - Instruction Tuning for Large Language Models: A Survey [52.86322823501338]
We make a systematic review of the literature, including the general methodology of IT, the construction of IT datasets, the training of IT models, and applications to different modalities, domains and applications.
We also review the potential pitfalls of IT along with criticism against it, along with efforts pointing out current deficiencies of existing strategies and suggest some avenues for fruitful research.
arXiv Detail & Related papers (2023-08-21T15:35:16Z) - A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks [7.542019351929903]
We evaluate four state-of-the-art instruction-tuned large language models (LLMs)
On a set of 13 real-world clinical and biomedical natural language processing (NLP) tasks in English.
arXiv Detail & Related papers (2023-07-22T15:58:17Z) - Are Large Language Models Ready for Healthcare? A Comparative Study on
Clinical Language Understanding [12.128991867050487]
Large language models (LLMs) have made significant progress in various domains, including healthcare.
In this study, we evaluate state-of-the-art LLMs within the realm of clinical language understanding tasks.
arXiv Detail & Related papers (2023-04-09T16:31:47Z) - Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical Medicine [68.7814360102644]
We propose the Re$3$Writer method with retrieval-augmented generation and knowledge-grounded reasoning.
We demonstrate the effectiveness of our method in generating patient discharge instructions.
arXiv Detail & Related papers (2022-10-23T16:34:39Z) - ITTC @ TREC 2021 Clinical Trials Track [54.141379782822206]
The task focuses on the problem of matching eligible clinical trials to topics constituting a summary of a patient's admission notes.
We explore different ways of representing trials and topics using NLP techniques, and then use a common retrieval model to generate the ranked list of relevant trials for each topic.
The results from all our submitted runs are well above the median scores for all topics, but there is still plenty of scope for improvement.
arXiv Detail & Related papers (2022-02-16T04:56:47Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.