Fugu-MT 論文翻訳(概要): DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding

論文の概要: DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding

arxiv url: http://arxiv.org/abs/2508.20416v1
Date: Thu, 28 Aug 2025 04:35:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-29 18:12:02.011652
Title: DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding
Title（参考訳）: DentalBench: バイリンガル歯学理解のためのLCMのベンチマークと改善
Authors: Hengchuan Zhu, Yihuan Xu, Yichen Li, Zijie Meng, Zuozhu Liu,
Abstract要約: 歯科領域における大規模言語モデル(LLM)の評価と進歩を目的とした,最初の総合的なベンチマークであるDentureBenchを紹介する。デンタルベンチは、4つのタスクと16の歯科サブフィールドにまたがる36,597の質問がある英語と中国語の質問回答(QA)ベンチマークであるデンタルQAと、337.35万のトークンを歯科領域適応のためにキュレートした大規模で高品質なコーパスであるデンタルコーパスの2つの主要コンポーネントで構成されている。
参考スコア（独自算出の注目度）: 18.678007079687706
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in large language models (LLMs) and medical LLMs (Med-LLMs) have demonstrated strong performance on general medical benchmarks. However, their capabilities in specialized medical fields, such as dentistry which require deeper domain-specific knowledge, remain underexplored due to the lack of targeted evaluation resources. In this paper, we introduce DentalBench, the first comprehensive bilingual benchmark designed to evaluate and advance LLMs in the dental domain. DentalBench consists of two main components: DentalQA, an English-Chinese question-answering (QA) benchmark with 36,597 questions spanning 4 tasks and 16 dental subfields; and DentalCorpus, a large-scale, high-quality corpus with 337.35 million tokens curated for dental domain adaptation, supporting both supervised fine-tuning (SFT) and retrieval-augmented generation (RAG). We evaluate 14 LLMs, covering proprietary, open-source, and medical-specific models, and reveal significant performance gaps across task types and languages. Further experiments with Qwen-2.5-3B demonstrate that domain adaptation substantially improves model performance, particularly on knowledge-intensive and terminology-focused tasks, and highlight the importance of domain-specific benchmarks for developing trustworthy and effective LLMs tailored to healthcare applications.
Abstract（参考訳）: 大規模言語モデル (LLMs) と医療用LLM (Med-LLMs) の最近の進歩は, 一般的な医学ベンチマークにおいて高い性能を示している。しかし, 対象とする評価資源が不足しているため, より深い専門知識を必要とする歯科医などの専門医療分野での能力は未探索のままである。本稿では,歯科領域におけるLSMの評価と進展を目的とした,最初の総合的バイリンガルベンチマークであるDustalBenchを紹介する。デンタルベンチは、英語と中国語の質問回答(QA)ベンチマークで、4つのタスクと16の歯科サブフィールドにまたがる36,597の質問と、歯科領域適応のために337.35万のトークンがキュレートされた大規模で高品質なコーパスで構成され、教師付き微調整(SFT)と検索強化世代(RAG)の両方をサポートする。我々は14のLSMを評価し、プロプライエタリ、オープンソース、医療特化モデルを網羅し、タスクタイプと言語間での大幅なパフォーマンスギャップを明らかにした。 Qwen-2.5-3Bによるさらなる実験では、ドメイン適応は、特に知識集約型および用語中心のタスクにおいて、モデルパフォーマンスを大幅に改善し、医療アプリケーションに適した信頼性と効果的なLLMを開発するための、ドメイン固有のベンチマークの重要性を強調している。

論文の概要: DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding

関連論文リスト