Fugu-MT 論文翻訳(概要): MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark

論文の概要: MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark

arxiv url: http://arxiv.org/abs/2602.01714v1
Date: Mon, 02 Feb 2026 06:52:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:33.959167
Title: MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark
Title（参考訳）: MedAraBench: 大規模なアラビア語の医療質問がデータセットとベンチマークに答える
Authors: Mouath Abu-Daoud, Leen Kharouf, Omar El Hajj, Dana El Samad, Mariam Al-Omari, Jihad Mallat, Khaled Saleh, Nizar Habash, Farah E. Shamout,
Abstract要約: アラビア語は自然言語処理研究において最も不十分な言語の一つである。メドラベンチ(MedAraBench)は、アラブの様々な専門分野にまたがる質問・回答のペアからなる大規模なデータセットである。
参考スコア（独自算出の注目度）: 8.428847258506176
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region. We then conducted extensive preprocessing and split the dataset into training and test sets to support future research efforts in the area. To assess the quality of the data, we adopted two frameworks, namely expert human evaluation and LLM-as-a-judge. Our dataset is diverse and of high quality, spanning 19 specialties and five difficulty levels. For benchmarking purposes, we assessed the performance of eight state-of-the-art open-source and proprietary models, such as GPT-5, Gemini 2.0 Flash, and Claude 4-Sonnet. Our findings highlight the need for further domain-specific enhancements. We release the dataset and evaluation scripts to broaden the diversity of medical data benchmarks, expand the scope of evaluation suites for LLMs, and enhance the multilingual capabilities of models for deployment in clinical settings.
Abstract（参考訳）: アラビア語は、オープンソースデータとベンチマークが限られているため、自然言語処理の研究、特に医学的応用において、最も貧弱な言語である。リソースの不足は、LLM(Large Language Models)の多言語能力の評価と向上に支障をきたす。本稿では,アラビア語の複数の質問応答対からなる大規模データセットであるMedAraBenchについて紹介する。我々は,アラビア語圏の医療専門家が作成した学術資料を手作業でデジタル化し,データセットを構築した。その後、広範囲な事前処理を行い、データセットをトレーニングとテストセットに分割して、この分野における今後の研究活動を支援しました。データの質を評価するために、専門家による評価とLSM-as-a-judgeという2つのフレームワークを採用した。私たちのデータセットは多様で、高品質で、19の専門知識と5つの困難レベルにまたがっています。ベンチマークのために、GPT-5、Gemini 2.0 Flash、Claude 4-Sonnetといった8つの最先端のオープンソースおよびプロプライエタリなモデルのパフォーマンスを評価した。この発見は、さらなるドメイン固有の拡張の必要性を浮き彫りにしている。このデータセットと評価スクリプトを公開し、医療データベンチマークの多様性を広げ、LCMの評価スイートの範囲を広げ、臨床環境に展開するモデルの多言語機能を強化する。

論文の概要: MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark

関連論文リスト