RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question
Answering and Clinical Reasoning
- URL: http://arxiv.org/abs/2402.14840v1
- Date: Mon, 19 Feb 2024 06:57:02 GMT
- Title: RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question
Answering and Clinical Reasoning
- Authors: Congyun Jin, Ming Zhang, Xiaowei Ma, Li Yujiao, Yingbo Wang, Yabo Jia,
Yuliang Du, Tao Sun, Haowen Wang, Cong Fan, Jinjie Gu, Chenfei Chi, Xiangguo
Lv, Fangzhou Li, Wei Xue, Yiran Huang
- Abstract summary: RJUA-MedDQA is a comprehensive benchmark in the field of medical specialization.
This work introduces RJUA-MedDQA, a comprehensive benchmark in the field of medical specialization.
- Score: 14.366349078707263
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in Large Language Models (LLMs) and Large Multi-modal
Models (LMMs) have shown potential in various medical applications, such as
Intelligent Medical Diagnosis. Although impressive results have been achieved,
we find that existing benchmarks do not reflect the complexity of real medical
reports and specialized in-depth reasoning capabilities. In this work, we
introduced RJUA-MedDQA, a comprehensive benchmark in the field of medical
specialization, which poses several challenges: comprehensively interpreting
imgage content across diverse challenging layouts, possessing numerical
reasoning ability to identify abnormal indicators and demonstrating clinical
reasoning ability to provide statements of disease diagnosis, status and advice
based on medical contexts. We carefully design the data generation pipeline and
proposed the Efficient Structural Restoration Annotation (ESRA) Method, aimed
at restoring textual and tabular content in medical report images. This method
substantially enhances annotation efficiency, doubling the productivity of each
annotator, and yields a 26.8% improvement in accuracy. We conduct extensive
evaluations, including few-shot assessments of 5 LMMs which are capable of
solving Chinese medical QA tasks. To further investigate the limitations and
potential of current LMMs, we conduct comparative experiments on a set of
strong LLMs by using image-text generated by ESRA method. We report the
performance of baselines and offer several observations: (1) The overall
performance of existing LMMs is still limited; however LMMs more robust to
low-quality and diverse-structured images compared to LLMs. (3) Reasoning
across context and image content present significant challenges. We hope this
benchmark helps the community make progress on these challenging tasks in
multi-modal medical document understanding and facilitate its application in
healthcare.
Related papers
- MedExQA: Medical Question Answering Benchmark with Multiple Explanations [2.2246416434538308]
This paper introduces MedExQA, a novel benchmark in medical question-answering to evaluate large language models' (LLMs) understanding of medical knowledge through explanations.
By constructing datasets across five distinct medical specialties, we address a major gap in current medical QA benchmarks.
Our work highlights the importance of explainability in medical LLMs, proposes an effective methodology for evaluating models beyond classification accuracy, and sheds light on one specific domain, speech language pathology.
arXiv Detail & Related papers (2024-06-10T14:47:04Z) - MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge [4.8004472307210255]
Large language models (LLMs) have excelled across domains, delivering notable performance on medical evaluation benchmarks.
However, there still exists a significant gap between the reported performance and the practical effectiveness in real-world medical scenarios.
We develop a novel evaluation framework MultifacetEval to examine the degree and coverage of LLMs in encoding and mastering medical knowledge.
arXiv Detail & Related papers (2024-06-05T04:15:07Z) - Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
Language Models [59.60384461302662]
We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs)
Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities.
We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM [48.16696073640864]
We introduce OmniMedVQA, a novel comprehensive medical Visual Question Answering (VQA) benchmark.
All images in this benchmark are sourced from authentic medical scenarios.
We have found that existing LVLMs struggle to address these medical VQA problems effectively.
arXiv Detail & Related papers (2024-02-14T13:51:56Z) - MedLM: Exploring Language Models for Medical Question Answering Systems [2.84801080855027]
Large Language Models (LLMs) with their advanced generative capabilities have shown promise in various NLP tasks.
This study aims to compare the performance of general and medical-specific distilled LMs for medical Q&A.
The findings will provide valuable insights into the suitability of different LMs for specific applications in the medical domain.
arXiv Detail & Related papers (2024-01-21T03:37:47Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - Large Language Models Illuminate a Progressive Pathway to Artificial
Healthcare Assistant: A Review [16.008511195589925]
Large language models (LLMs) have shown promising capabilities in mimicking human-level language comprehension and reasoning.
This paper provides a comprehensive review on the applications and implications of LLMs in medicine.
arXiv Detail & Related papers (2023-11-03T13:51:36Z) - An Automatic Evaluation Framework for Multi-turn Medical Consultations
Capabilities of Large Language Models [22.409334091186995]
Large language models (LLMs) often suffer from hallucinations, leading to overly confident but incorrect judgments.
This paper introduces an automated evaluation framework that assesses the practical capabilities of LLMs as virtual doctors during multi-turn consultations.
arXiv Detail & Related papers (2023-09-05T09:24:48Z) - Towards Medical Artificial General Intelligence via Knowledge-Enhanced
Multimodal Pretraining [121.89793208683625]
Medical artificial general intelligence (MAGI) enables one foundation model to solve different medical tasks.
We propose a new paradigm called Medical-knedge-enhanced mulTimOdal pretRaining (MOTOR)
arXiv Detail & Related papers (2023-04-26T01:26:19Z) - SPeC: A Soft Prompt-Based Calibration on Performance Variability of
Large Language Model in Clinical Notes Summarization [50.01382938451978]
We introduce a model-agnostic pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization.
Experimental findings indicate that our method not only bolsters performance but also effectively curbs variance for various language models.
arXiv Detail & Related papers (2023-03-23T04:47:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.