DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding
- URL: http://arxiv.org/abs/2508.20416v1
- Date: Thu, 28 Aug 2025 04:35:51 GMT
- Title: DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding
- Authors: Hengchuan Zhu, Yihuan Xu, Yichen Li, Zijie Meng, Zuozhu Liu,
- Abstract summary: We introduce DentalBench, the first comprehensive benchmark designed to evaluate and advance large language models (LLMs) in the dental domain.<n> DentalBench consists of two main components: DentalQA, an English-Chinese question-answering (QA) benchmark with 36,597 questions spanning 4 tasks and 16 dental subfields; and DentalCorpus, a large-scale, high-quality corpus with 337.35 million tokens curated for dental domain adaptation.
- Score: 18.678007079687706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in large language models (LLMs) and medical LLMs (Med-LLMs) have demonstrated strong performance on general medical benchmarks. However, their capabilities in specialized medical fields, such as dentistry which require deeper domain-specific knowledge, remain underexplored due to the lack of targeted evaluation resources. In this paper, we introduce DentalBench, the first comprehensive bilingual benchmark designed to evaluate and advance LLMs in the dental domain. DentalBench consists of two main components: DentalQA, an English-Chinese question-answering (QA) benchmark with 36,597 questions spanning 4 tasks and 16 dental subfields; and DentalCorpus, a large-scale, high-quality corpus with 337.35 million tokens curated for dental domain adaptation, supporting both supervised fine-tuning (SFT) and retrieval-augmented generation (RAG). We evaluate 14 LLMs, covering proprietary, open-source, and medical-specific models, and reveal significant performance gaps across task types and languages. Further experiments with Qwen-2.5-3B demonstrate that domain adaptation substantially improves model performance, particularly on knowledge-intensive and terminology-focused tasks, and highlight the importance of domain-specific benchmarks for developing trustworthy and effective LLMs tailored to healthcare applications.
Related papers
- DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry [28.389946455559713]
Current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details.<n>We present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning.
arXiv Detail & Related papers (2025-12-12T13:42:57Z) - OralGPT-Omni: A Versatile Dental Multimodal Large Language Model [44.919874082284686]
We present OralGPT- Omni, the first dental-specialized MLLM for comprehensive analysis across diverse dental imaging modalities and clinical tasks.<n>To explicitly capture dentists' diagnostic reasoning, we construct TRACE-CoT, a clinically grounded chain-of-thought dataset.<n>In parallel, we introduce MMOral-Uni, the first unified multimodal benchmark for dental image analysis.
arXiv Detail & Related papers (2025-11-27T03:21:20Z) - DentVLM: A Multimodal Vision-Language Model for Comprehensive Dental Diagnosis and Enhanced Clinical Practice [71.62725911420627]
We introduce DentVLM, a vision-language model engineered for expert-level oral disease diagnosis.<n>The model is capable of interpreting seven 2D oral imaging modalities across 36 diagnostic tasks.<n>It surpassed the diagnostic performance of 13 junior dentists on 21 of 36 tasks and exceeded that of 12 senior dentists on 12 of 36 tasks.
arXiv Detail & Related papers (2025-09-27T14:47:37Z) - Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis [16.403842140593706]
We introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation.<n>We present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry.<n>We also propose OralGPT, which conducts supervised fine-tuning upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset.
arXiv Detail & Related papers (2025-09-11T08:39:08Z) - Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models [57.73472878679636]
We introduce Med-RewardBench, the first benchmark specifically designed to evaluate medical reward models and judges.<n>Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases.<n>A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions.
arXiv Detail & Related papers (2025-08-29T08:58:39Z) - CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBench is a comprehensive benchmark with 14 expert-guided core clinical scenarios.
The reliability of this benchmark has been confirmed in several ways.
arXiv Detail & Related papers (2024-10-04T15:15:36Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - Are Large Language Models Ready for Healthcare? A Comparative Study on
Clinical Language Understanding [12.128991867050487]
Large language models (LLMs) have made significant progress in various domains, including healthcare.
In this study, we evaluate state-of-the-art LLMs within the realm of clinical language understanding tasks.
arXiv Detail & Related papers (2023-04-09T16:31:47Z) - ChatGPT for Shaping the Future of Dentistry: The Potential of
Multi-Modal Large Language Model [18.59603757924943]
ChatGPT is a lite and conversational variant of Generative Pretrained Transformer 4 (GPT-4) developed by OpenAI.
This paper mainly discusses the future applications of Large Language Models (LLMs) in dentistry.
arXiv Detail & Related papers (2023-03-23T15:34:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.