ChiMed 2.0: Advancing Chinese Medical Dataset in Facilitating Large Language Modeling
- URL: http://arxiv.org/abs/2507.15275v1
- Date: Mon, 21 Jul 2025 06:23:16 GMT
- Title: ChiMed 2.0: Advancing Chinese Medical Dataset in Facilitating Large Language Modeling
- Authors: Yuanhe Tian, Junjie Liu, Zhizhou Kou, Yuxiang Li, Yan Song,
- Abstract summary: Existing Chinese medical datasets are limited in size and narrow in domain coverage.<n>ChiMed 2.0 contains 204.4M Chinese characters covering both traditional Chinese medicine classics and modern general medical data.
- Score: 18.816065236545615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Building high-quality data resources is crucial for advancing artificial intelligence research and applications in specific domains, particularly in the Chinese medical domain. Existing Chinese medical datasets are limited in size and narrow in domain coverage, falling short of the diverse corpora required for effective pre-training. Moreover, most datasets are designed solely for LLM fine-tuning and do not support pre-training and reinforcement learning from human feedback (RLHF). In this paper, we propose a Chinese medical dataset named ChiMed 2.0, which extends our previous work ChiMed, and covers data collected from Chinese medical online platforms and generated by LLMs. ChiMed 2.0 contains 204.4M Chinese characters covering both traditional Chinese medicine classics and modern general medical data, where there are 164.8K documents for pre-training, 351.6K question-answering pairs for supervised fine-tuning (SFT), and 41.7K preference data tuples for RLHF. To validate the effectiveness of our approach for training a Chinese medical LLM, we conduct further pre-training, SFT, and RLHF experiments on representative general domain LLMs and evaluate their performance on medical benchmark datasets. The results show performance gains across different model scales, validating the dataset's effectiveness and applicability.
Related papers
- Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z) - MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks [7.822971505079421]
This study introduces MedArabiQ, a novel benchmark dataset consisting of seven Arabic medical tasks.<n>We first constructed the dataset using past medical exams and publicly available datasets.<n>We then introduced different modifications to evaluate various LLM capabilities, including bias mitigation.
arXiv Detail & Related papers (2025-05-06T11:07:26Z) - FineMedLM-o1: Enhancing Medical Knowledge Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training [20.259483872569987]
FineMedLM-o1 is a medical large language model with deep reasoning capabilities.<n>We introduce Test-Time Training (TTT) in the medical domain for the first time, facilitating domain adaptation and ensuring reliable, accurate reasoning.<n>The project and data will be released on GitHub.
arXiv Detail & Related papers (2025-01-16T00:19:19Z) - STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering [58.79671189792399]
STLLaVA-Med is designed to train a policy model capable of auto-generating medical visual instruction data.
We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks.
arXiv Detail & Related papers (2024-06-28T15:01:23Z) - Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Models [8.252044870864523]
We propose Aquila-Med, a bilingual medical LLM based on Aquila.
We construct a large-scale Chinese and English medical dataset for continue pre-training and a high-quality SFT dataset.
Aquila-Med achieves notable results across single-turn, multi-turn dialogues, and medical multiple-choice questions.
arXiv Detail & Related papers (2024-06-18T01:30:07Z) - Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams [32.77551245372691]
Existing benchmarks for evaluating Large Language Models (LLMs) in healthcare predominantly focus on medical doctors.
We introduce the Examinations for Medical Personnel in Chinese (EMPEC), a pioneering large-scale healthcare knowledge benchmark in traditional Chinese.
EMPEC consists of 157,803 exam questions across 124 subjects and 20 healthcare professions, including underrepresented occupations like Optometrists and Audiologists.
arXiv Detail & Related papers (2024-06-17T08:40:36Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs [61.41790586411816]
HuatuoGPT-II has shown state-of-the-art performance in Chinese medicine domain on a number of benchmarks.
It even outperforms proprietary models like ChatGPT and GPT-4 in some aspects, especially in Traditional Chinese Medicine.
arXiv Detail & Related papers (2023-11-16T10:56:24Z) - ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences [51.66185471742271]
We propose ChiMed-GPT, a benchmark LLM designed explicitly for Chinese medical domain.
ChiMed-GPT undergoes a comprehensive training regime with pre-training, SFT, and RLHF.
We analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients.
arXiv Detail & Related papers (2023-11-10T12:25:32Z) - Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language
Model through Expert Feedback and Real-world Multi-turn Dialogue [4.558040877516838]
We introduce Zhongjing, the first Chinese medical Large Language Models (LLMs) that implements an entire training pipeline from continuous pre-training, SFT, to Reinforcement Learning from Human Feedback (RLHF)
We construct a Chinese multi-turn medical dialogue dataset of 70,000 authentic doctor-patient dialogues, CMtMedQA, which significantly enhances the model's capability for complex dialogue and proactive inquiry initiation.
arXiv Detail & Related papers (2023-08-07T12:56:13Z) - PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding.
LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge.
We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.