Related papers: A benchmark multimodal oro-dental dataset for large vision-language models

A benchmark multimodal oro-dental dataset for large vision-language models

URL: http://arxiv.org/abs/2511.04948v1
Date: Fri, 07 Nov 2025 03:14:20 GMT
Title: A benchmark multimodal oro-dental dataset for large vision-language models
Authors: Haoxin Lv, Ijazul Haq, Jin Du, Jiaxin Ma, Binnian Zhu, Xiaobing Dang, Chaoan Liang, Ruxu Du, Yingjie Zhang, Muhammad Saqib,
Abstract summary: The dataset includes 50000 intraoral images, 8056 radiographs, and detailed textual records, including diagnoses, treatment plans, and follow-up notes.<n>We fine-tuned state-of-the-art large vision-language models, Qwen-VL 3B and 7B, and evaluated them on two tasks: classification of six oro-dental anomalies and generation of complete diagnostic reports.<n>The dataset is publicly available, providing an essential resource for future research in AI dentistry.
Score: 5.063576567382722
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The advancement of artificial intelligence in oral healthcare relies on the availability of large-scale multimodal datasets that capture the complexity of clinical practice. In this paper, we present a comprehensive multimodal dataset, comprising 8775 dental checkups from 4800 patients collected over eight years (2018-2025), with patients ranging from 10 to 90 years of age. The dataset includes 50000 intraoral images, 8056 radiographs, and detailed textual records, including diagnoses, treatment plans, and follow-up notes. The data were collected under standard ethical guidelines and annotated for benchmarking. To demonstrate its utility, we fine-tuned state-of-the-art large vision-language models, Qwen-VL 3B and 7B, and evaluated them on two tasks: classification of six oro-dental anomalies and generation of complete diagnostic reports from multimodal inputs. We compared the fine-tuned models with their base counterparts and GPT-4o. The fine-tuned models achieved substantial gains over these baselines, validating the dataset and underscoring its effectiveness in advancing AI-driven oro-dental healthcare solutions. The dataset is publicly available, providing an essential resource for future research in AI dentistry.

Related papers

DentVLM: A Multimodal Vision-Language Model for Comprehensive Dental Diagnosis and Enhanced Clinical Practice [71.62725911420627]
We introduce DentVLM, a vision-language model engineered for expert-level oral disease diagnosis.<n>The model is capable of interpreting seven 2D oral imaging modalities across 36 diagnostic tasks.<n>It surpassed the diagnostic performance of 13 junior dentists on 21 of 36 tasks and exceeded that of 12 senior dentists on 12 of 36 tasks.
arXiv Detail & Related papers (2025-09-27T14:47:37Z)
Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis [16.403842140593706]
We introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation.<n>We present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry.<n>We also propose OralGPT, which conducts supervised fine-tuning upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset.
arXiv Detail & Related papers (2025-09-11T08:39:08Z)
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature [73.39593644054865]
BIOMEDICA is a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset.<n>Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles.<n> BMCA-CLIP is a suite of CLIP-style models continuously pretrained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally.
arXiv Detail & Related papers (2025-01-13T09:58:03Z)
How Well Can Modern LLMs Act as Agent Cores in Radiology Environments? [54.36730060680139]
RadA-BenchPlat is an evaluation platform that benchmarks the performance of large language models (LLMs) in radiology environments.<n>The platform also defines ten categories of tools for agent-driven task solving and evaluates seven leading LLMs.
arXiv Detail & Related papers (2024-12-12T18:20:16Z)
SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation [13.672776832197918]
Multimodal large language models (MLLMs) have made significant strides, yet they face challenges in the medical domain due to limited specialized knowledge. We seek to address this gap at various stages of the end-to-end learning pipeline, including data collection, model fine-tuning, and evaluation.
arXiv Detail & Related papers (2024-10-19T02:35:35Z)
PathInsight: Instruction Tuning of Multimodal Datasets and Models for Intelligence Assisted Diagnosis in Histopathology [7.87900104748629]
We have meticulously compiled a dataset of approximately 45,000 cases, covering over 6 different tasks. We have fine-tuned multimodal large models, specifically LLaVA, Qwen-VL, InternLM, with this dataset to enhance instruction-based performance.
arXiv Detail & Related papers (2024-08-13T17:05:06Z)
MGH Radiology Llama: A Llama 3 70B Model for Radiology [50.42811030970618]
This paper presents an advanced radiology-focused large language model: MGH Radiology Llama.<n>It is developed using the Llama 3 70B model, building upon previous domain-specific models like Radiology-GPT and Radiology-Llama2.<n>Our evaluation, incorporating both traditional metrics and a GPT-4-based assessment, highlights the enhanced performance of this work over general-purpose LLMs.
arXiv Detail & Related papers (2024-08-13T01:30:03Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z)
Integrated multimodal artificial intelligence framework for healthcare applications [3.6222901399459215]
We propose and evaluate a unified Holistic AI in Medicine framework to facilitate the generation and testing of AI systems that leverage multimodal inputs. Our approach uses generalizable data pre-processing and machine learning modeling stages that can be readily adapted for research and deployment in healthcare environments. We show that this framework can consistently and robustly produce models that outperform similar single-source approaches across various healthcare demonstrations.
arXiv Detail & Related papers (2022-02-25T22:08:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.