Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis
- URL: http://arxiv.org/abs/2509.09254v1
- Date: Thu, 11 Sep 2025 08:39:08 GMT
- Title: Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis
- Authors: Jing Hao, Yuxuan Fan, Yanpeng Sun, Kaixin Guo, Lizhuo Lin, Jinrong Yang, Qi Yong H. Ai, Lun M. Wong, Hao Tang, Kuo Feng Hung,
- Abstract summary: We introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation.<n>We present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry.<n>We also propose OralGPT, which conducts supervised fine-tuning upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset.
- Score: 16.403842140593706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.
Related papers
- DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry [28.389946455559713]
Current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details.<n>We present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning.
arXiv Detail & Related papers (2025-12-12T13:42:57Z) - OralGPT-Omni: A Versatile Dental Multimodal Large Language Model [44.919874082284686]
We present OralGPT- Omni, the first dental-specialized MLLM for comprehensive analysis across diverse dental imaging modalities and clinical tasks.<n>To explicitly capture dentists' diagnostic reasoning, we construct TRACE-CoT, a clinically grounded chain-of-thought dataset.<n>In parallel, we introduce MMOral-Uni, the first unified multimodal benchmark for dental image analysis.
arXiv Detail & Related papers (2025-11-27T03:21:20Z) - Towards Generalist Intelligence in Dentistry: Vision Foundation Models for Oral and Maxillofacial Radiology [22.124686092997717]
DentVFM is the first family of vision foundation models (VFMs) designed for dentistry.<n>It generates task-agnostic visual representations for a wide range of dental applications.<n>It shows impressive generalist intelligence, demonstrating robust generalization to diverse dental tasks.
arXiv Detail & Related papers (2025-10-16T10:24:23Z) - A Multi-Stage Fine-Tuning and Ensembling Strategy for Pancreatic Tumor Segmentation in Diagnostic and Therapeutic MRI [7.8413564248632825]
This paper details our submission to the PANTHER challenge, addressing both diagnostic T1-weighted (Task 1) and therapeutic T2-weighted (Task 2)<n>Our approach is built upon the nnU-Net framework and leverages a deep, multi-stage cascaded pre-training strategy.<n>Our analysis revealed a critical trade-off, where aggressive data augmentation produced the highest volumetric accuracy.
arXiv Detail & Related papers (2025-08-29T16:50:29Z) - Adapting Foundation Model for Dental Caries Detection with Dual-View Co-Training [53.77904429789069]
We present Attention-TNet, a novel Dual-View Co-Training network for accurate dental caries detection.<n>OurTNet starts with employing automated tooth detection to establish two complementary views: a global view from panoramic X-ray images and a local view from cropped tooth images.<n>To effectively integrate information from both views, we introduce a Gated Cross-View module.
arXiv Detail & Related papers (2025-08-28T14:13:26Z) - DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding [18.678007079687706]
We introduce DentalBench, the first comprehensive benchmark designed to evaluate and advance large language models (LLMs) in the dental domain.<n> DentalBench consists of two main components: DentalQA, an English-Chinese question-answering (QA) benchmark with 36,597 questions spanning 4 tasks and 16 dental subfields; and DentalCorpus, a large-scale, high-quality corpus with 337.35 million tokens curated for dental domain adaptation.
arXiv Detail & Related papers (2025-08-28T04:35:51Z) - AMRG: Extend Vision Language Models for Automatic Mammography Report Generation [4.366802575084445]
Mammography report generation is a critical yet underexplored task in medical AI.<n>We introduce AMRG, the first end-to-end framework for generating narrative mammography reports.<n>We train and evaluate AMRG on DMID, a publicly available dataset of paired high-resolution mammograms and diagnostic reports.
arXiv Detail & Related papers (2025-08-12T06:37:41Z) - EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis [62.00431604976949]
EndoBench is the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice.<n>We benchmark 23 state-of-the-art models, including general-purpose, medical-specialized, and proprietary MLLMs.<n>Our experiments reveal: proprietary MLLMs outperform open-source and medical-specialized models overall, but still trail human experts.
arXiv Detail & Related papers (2025-05-29T16:14:34Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision.
This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z) - OralBBNet: Spatially Guided Dental Segmentation of Panoramic X-Rays with Bounding Box Priors [34.82692226532414]
OralBBNet is designed to improve the accuracy and robustness of tooth classification and segmentation on panoramic X-rays.<n>Our approach achieved a 1-3% improvement in mean average precision (mAP) for tooth detection compared to existing techniques.<n>Results of this study establish a foundation for the wider implementation of object detection models in dental diagnostics.
arXiv Detail & Related papers (2024-06-06T04:57:29Z) - Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology.
For training, we assemble a large dataset of over 697 thousand radiology image-text pairs.
For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation.
The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.