Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation
- URL: http://arxiv.org/abs/2501.06741v1
- Date: Sun, 12 Jan 2025 07:30:49 GMT
- Title: Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation
- Authors: Shunfan Zheng, Xiechi Zhang, Gerard de Melo, Xiaoling Wang, Linlin Wang,
- Abstract summary: HDCEval is built on a set of fine-grained medical evaluation guidelines developed in collaboration with professional doctors.<n>The framework decomposes complex evaluation tasks into specialized subtasks, each evaluated by expert models.<n>This hierarchical approach ensures that each aspect of the evaluation is handled with expert precision, leading to a significant improvement in alignment with human evaluators.
- Score: 31.061600616994145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the rapidly evolving landscape of large language models (LLMs) for medical applications, ensuring the reliability and accuracy of these models in clinical settings is paramount. Existing benchmarks often focus on fixed-format tasks like multiple-choice QA, which fail to capture the complexity of real-world clinical diagnostics. Moreover, traditional evaluation metrics and LLM-based evaluators struggle with misalignment, often providing oversimplified assessments that do not adequately reflect human judgment. To address these challenges, we introduce HDCEval, a Hierarchical Divide-and-Conquer Evaluation framework tailored for fine-grained alignment in medical evaluation. HDCEval is built on a set of fine-grained medical evaluation guidelines developed in collaboration with professional doctors, encompassing Patient Question Relevance, Medical Knowledge Correctness, and Expression. The framework decomposes complex evaluation tasks into specialized subtasks, each evaluated by expert models trained through Attribute-Driven Token Optimization (ADTO) on a meticulously curated preference dataset. This hierarchical approach ensures that each aspect of the evaluation is handled with expert precision, leading to a significant improvement in alignment with human evaluators.
Related papers
- Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.
Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.
We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z) - A Scalable Framework for Evaluating Health Language Models [16.253655494186905]
Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets.
Current evaluation practices for open-ended text responses heavily rely on human experts.
This work introduces Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions.
arXiv Detail & Related papers (2025-03-30T06:47:57Z) - GPBench: A Comprehensive and Fine-Grained Benchmark for Evaluating Large Language Models as General Practitioners [12.208184074411896]
General practitioners (GPs) serve as the cornerstone of primary healthcare systems by providing continuous and comprehensive medical services.
Due to community-oriented nature of their practice, uneven training and resource gaps, the clinical proficiency among GPs can vary significantly across regions and healthcare settings.
Large Language Models (LLMs) have demonstrated great potential in clinical and medical applications, making them a promising tool for supporting general practice.
To evaluate how effectively LLMs can make decisions in the daily work of GPs, we designed GPBench, which consists of both test questions from clinical practice and a novel evaluation framework.
arXiv Detail & Related papers (2025-03-22T01:02:44Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.
We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.
Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports [0.0]
We introduce a novel question-answering (QA) dataset using echocardiogram reports sourced from the Medical Information Mart for Intensive Care database.
This dataset is specifically designed to enhance QA systems in cardiology, consisting of 771,244 QA pairs addressing a wide array of cardiac abnormalities and their severity.
We compare large language models (LLMs), including open-source and biomedical-specific models for zero-shot evaluation, and closed-source models for zero-shot and three-shot evaluation.
arXiv Detail & Related papers (2025-03-04T07:45:45Z) - MedCoT: Medical Chain of Thought via Hierarchical Expert [48.91966620985221]
This paper presents MedCoT, a novel hierarchical expert verification reasoning chain method.<n>It is designed to enhance interpretability and accuracy in biomedical imaging inquiries.<n> Experimental evaluations on four standard Med-VQA datasets demonstrate that MedCoT surpasses existing state-of-the-art approaches.
arXiv Detail & Related papers (2024-12-18T11:14:02Z) - Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - Exploring LLM-based Data Annotation Strategies for Medical Dialogue Preference Alignment [22.983780823136925]
This research examines the use of Reinforcement Learning from AI Feedback (RLAIF) techniques to improve healthcare dialogue models.
We argue that the primary challenges in current RLAIF research for healthcare are the limitations of automated evaluation methods.
We present a new evaluation framework based on standardized patient examinations.
arXiv Detail & Related papers (2024-10-05T10:29:19Z) - Towards Automatic Evaluation for LLMs' Clinical Capabilities: Metric, Data, and Algorithm [15.627870862369784]
Large language models (LLMs) are gaining increasing interests to improve clinical efficiency for medical diagnosis.
We propose an automatic evaluation paradigm tailored to assess the LLMs' capabilities in delivering clinical services.
arXiv Detail & Related papers (2024-03-25T06:17:54Z) - A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models [20.11590976578911]
Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities.
Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity.
We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions.
arXiv Detail & Related papers (2024-03-18T17:56:37Z) - HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical
Criteria Decomposition [92.17397504834825]
HD-Eval is a framework that iteratively aligns large language models evaluators with human preference.
HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators.
Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators.
arXiv Detail & Related papers (2024-02-24T08:01:32Z) - TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic
Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment.
In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials.
We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z) - Consultation Checklists: Standardising the Human Evaluation of Medical
Note Generation [58.54483567073125]
We propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists.
We observed good levels of inter-annotator agreement in a first evaluation study using the protocol.
arXiv Detail & Related papers (2022-11-17T10:54:28Z) - The Medkit-Learn(ing) Environment: Medical Decision Modelling through
Simulation [81.72197368690031]
We present a new benchmarking suite designed specifically for medical sequential decision making.
The Medkit-Learn(ing) Environment is a publicly available Python package providing simple and easy access to high-fidelity synthetic medical data.
arXiv Detail & Related papers (2021-06-08T10:38:09Z) - Optimizing Medical Treatment for Sepsis in Intensive Care: from
Reinforcement Learning to Pre-Trial Evaluation [2.908482270923597]
Our aim is to establish a framework where reinforcement learning (RL) of optimizing interventions retrospectively allows us a regulatory compliant pathway to prospective clinical testing of the learned policies.
We focus on infections in intensive care units which are one of the major causes of death and difficult to treat because of the complex and opaque patient dynamics.
arXiv Detail & Related papers (2020-03-13T20:31:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.