TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine
- URL: http://arxiv.org/abs/2505.24063v1
- Date: Thu, 29 May 2025 23:13:57 GMT
- Title: TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine
- Authors: Jiacheng Xie, Yang Yu, Ziyang Zhang, Shuai Zeng, Jiaxuan He, Ayush Vasireddy, Xiaoting Tang, Congyu Guo, Lening Zhao, Congcong Jing, Guanghui An, Dong Xu,
- Abstract summary: We introduce TCM-Ladder, the first multimodal QA dataset specifically designed for evaluating large TCM language models.<n>The dataset spans multiple core disciplines of TCM, including fundamental theory, diagnostics, herbal formulas, internal medicine, surgery, pharmacognosy, and pediatrics.<n>The datasets were constructed using a combination of automated and manual filtering processes and comprise 52,000+ questions in total.
- Score: 21.46828174190836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional Chinese Medicine (TCM), as an effective alternative medicine, has been receiving increasing attention. In recent years, the rapid development of large language models (LLMs) tailored for TCM has underscored the need for an objective and comprehensive evaluation framework to assess their performance on real-world tasks. However, existing evaluation datasets are limited in scope and primarily text-based, lacking a unified and standardized multimodal question-answering (QA) benchmark. To address this issue, we introduce TCM-Ladder, the first multimodal QA dataset specifically designed for evaluating large TCM language models. The dataset spans multiple core disciplines of TCM, including fundamental theory, diagnostics, herbal formulas, internal medicine, surgery, pharmacognosy, and pediatrics. In addition to textual content, TCM-Ladder incorporates various modalities such as images and videos. The datasets were constructed using a combination of automated and manual filtering processes and comprise 52,000+ questions in total. These questions include single-choice, multiple-choice, fill-in-the-blank, diagnostic dialogue, and visual comprehension tasks. We trained a reasoning model on TCM-Ladder and conducted comparative experiments against 9 state-of-the-art general domain and 5 leading TCM-specific LLMs to evaluate their performance on the datasets. Moreover, we propose Ladder-Score, an evaluation method specifically designed for TCM question answering that effectively assesses answer quality regarding terminology usage and semantic expression. To our knowledge, this is the first work to evaluate mainstream general domain and TCM-specific LLMs on a unified multimodal benchmark. The datasets and leaderboard are publicly available at https://tcmladder.com or https://54.211.107.106 and will be continuously updated.
Related papers
- MTCMB: A Multi-Task Benchmark Framework for Evaluating LLMs on Knowledge, Reasoning, and Safety in Traditional Chinese Medicine [36.08458917280579]
MTCMB comprises 12 sub-datasets spanning five major categories: knowledge QA, language understanding, diagnostic reasoning, prescription generation, and safety evaluation.<n>Preliminary results indicate that current LLMs perform well on foundational knowledge but fall short in clinical reasoning, prescription planning, and safety compliance.
arXiv Detail & Related papers (2025-06-02T02:01:40Z) - TCM-3CEval: A Triaxial Benchmark for Assessing Responses from Large Language Models in Traditional Chinese Medicine [10.74071774496229]
Large language models (LLMs) excel in various NLP tasks and modern medicine, but their evaluation in traditional Chinese medicine (TCM) is underexplored.<n>To address this, we introduce TCM3CEval, a benchmark assessing LLMs in TCM across three dimensions: core knowledge mastery, classical text understanding, and clinical decision-making.<n>Results show a performance hierarchy: all models have limitations in specialized like Meridian & Acupoint theory and Various TCM Schools, revealing gaps between current capabilities and clinical needs.
arXiv Detail & Related papers (2025-03-10T08:29:15Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine [19.680694337954133]
The professional evaluation benchmarks for large language models (LLMs) have yet to be covered in the traditional Chinese medicine(TCM) domain.
To address this research gap, we introduce TCM-Bench, a comprehensive benchmark for evaluating LLM performance in TCM.
It comprises the TCM-ED dataset, consisting of 5,473 questions sourced from the TCM Licensing Exam (TCMLE), including 1,300 questions with authoritative analysis.
To evaluate LLMs beyond accuracy of question answering, we propose TCMScore, a metric tailored for evaluating the quality of answers generated by LLMs for TCM related questions.
arXiv Detail & Related papers (2024-06-03T09:11:13Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models [17.643421997037514]
We propose a novel framework that tackles both discriminative and generative multimodal medical tasks.
The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning.
Our model can achieve performance superior to or on par with state-of-the-art baselines.
arXiv Detail & Related papers (2024-04-16T02:35:17Z) - TCM-GPT: Efficient Pre-training of Large Language Models for Domain
Adaptation in Traditional Chinese Medicine [11.537289359051975]
We propose a novel TCMDA (TCM Domain Adaptation) approach, efficient pre-training with domain-specific corpus.
Specifically, we first construct a large TCM-specific corpus, TCM-Corpus-1B, by identifying domain keywords and retreving from general corpus.
Then, our TCMDA leverages the LoRA which freezes the pretrained model's weights and uses rank decomposition matrices to efficiently train specific dense layers for pre-training and fine-tuning.
arXiv Detail & Related papers (2023-11-03T08:54:50Z) - RethinkingTMSC: An Empirical Study for Target-Oriented Multimodal
Sentiment Classification [70.9087014537896]
Target-oriented Multimodal Sentiment Classification (TMSC) has gained significant attention among scholars.
To investigate the causes of this problem, we perform extensive empirical evaluation and in-depth analysis of the datasets.
arXiv Detail & Related papers (2023-10-14T14:52:37Z) - MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions.
Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z) - PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding.
LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge.
We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.