AraHealthQA 2025: The First Shared Task on Arabic Health Question Answering
- URL: http://arxiv.org/abs/2508.20047v3
- Date: Mon, 15 Sep 2025 05:11:57 GMT
- Title: AraHealthQA 2025: The First Shared Task on Arabic Health Question Answering
- Authors: Hassan Alhuzali, Walid Al-Eisawi, Muhammad Abdul-Mageed, Chaimae Abouzahir, Mouath Abu-Daoud, Ashwag Alasmari, Renad Al-Monef, Ali Alqahtani, Lama Ayash, Leen Kharouf, Farah E. Shamout, Nizar Habash,
- Abstract summary: AraHealthQA 2025, the Comprehensive Arabic Health Question Answering Shared Task, held in conjunction with ArabicNLP 2025 (co-located with EMNLP 2025)<n>MentalQA, focusing on Arabic mental health Q&A (e.g., anxiety, depression, stigma reduction), and MedArabiQ, covering broader medical domains such as internal medicine, pediatrics, and clinical decision making.<n>We outline the dataset creation, task design and evaluation framework, participation statistics, baseline systems, and summarize the overall outcomes.
- Score: 23.830127107611744
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We introduce AraHealthQA 2025, the Comprehensive Arabic Health Question Answering Shared Task, held in conjunction with ArabicNLP 2025 (co-located with EMNLP 2025). This shared task addresses the paucity of high-quality Arabic medical QA resources by offering two complementary tracks: MentalQA, focusing on Arabic mental health Q&A (e.g., anxiety, depression, stigma reduction), and MedArabiQ, covering broader medical domains such as internal medicine, pediatrics, and clinical decision making. Each track comprises multiple subtasks, evaluation datasets, and standardized metrics, facilitating fair benchmarking. The task was structured to promote modeling under realistic, multilingual, and culturally nuanced healthcare contexts. We outline the dataset creation, task design and evaluation framework, participation statistics, baseline systems, and summarize the overall outcomes. We conclude with reflections on the performance trends observed and prospects for future iterations in Arabic health QA.
Related papers
- !MSA at AraHealthQA 2025 Shared Task: Enhancing LLM Performance for Arabic Clinical Question Answering through Prompt Engineering and Ensemble Learning [0.0]
We present our systems for Track 2 (General Arabic Health QA, MedArabiQ) of the AraHealthQA-2025 shared task.<n>Our methodology secured 2nd place in both Sub-Task 1 (multiple-choice question answering) and Sub-Task 2 (open-ended question answering) in Arabic clinical contexts.
arXiv Detail & Related papers (2025-09-14T17:39:58Z) - MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian [50.767415194856135]
We introduce MedQARo, the first large-scale medical QA benchmark in Romanian.<n>We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients.
arXiv Detail & Related papers (2025-08-22T13:48:37Z) - Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks [1.3521447196536418]
This research examines the degree to which state-of-the-art large language models demonstrate and articulate healthcare knowledge in Arabic.<n>We benchmark several LLMs using a medical dataset proposed in the Arabic NLP AraHealthQA challenge in MedArabiQ2025 track.<n>Our results reveal significant variations in correct answer prediction accuracy and low variations in semantic alignment of generated answers.
arXiv Detail & Related papers (2025-08-13T10:41:17Z) - PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language [0.1747623282473278]
PerMedCQA is the first Persian-language benchmark for evaluating large language models for medical consumer question answering.<n>We evaluate several state-of-the-art multilingual and instruction-tuned LLMs, utilizing MedJudge, a novel-based evaluation framework driven by an LLM grader.<n>Our results highlight key challenges in multilingual medical QA and provide valuable insights for developing more accurate and context-aware medical assistance systems.
arXiv Detail & Related papers (2025-05-23T19:39:01Z) - MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks [8.379270814399431]
This study introduces MedArabiQ, a novel benchmark dataset consisting of seven Arabic medical tasks.<n>We first constructed the dataset using past medical exams and publicly available datasets.<n>We then introduced different modifications to evaluate various LLM capabilities, including bias mitigation.
arXiv Detail & Related papers (2025-05-06T11:07:26Z) - LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z) - HealthQ: Unveiling Questioning Capabilities of LLM Chains in Healthcare Conversations [20.31796453890812]
HealthQ is a framework for evaluating the questioning capabilities of large language models (LLMs) in healthcare conversations.<n>We integrate an LLM judge to evaluate generated questions across metrics such as specificity, relevance, and usefulness.<n>We present the first systematic framework for assessing questioning capabilities in healthcare conversations, establish a model-agnostic evaluation methodology, and provide empirical evidence linking high-quality questions to improved patient information elicitation.
arXiv Detail & Related papers (2024-09-28T23:59:46Z) - MentalQA: An Annotated Arabic Corpus for Questions and Answers of Mental Healthcare [0.1638581561083717]
MentalQA is a novel Arabic dataset featuring conversational-style question-and-answer (QA) interactions.
Data was collected from a question-answering medical platform.
MentalQA offers a valuable foundation for developing Arabic text mining tools capable of supporting mental health professionals and individuals seeking information.
arXiv Detail & Related papers (2024-05-21T09:16:38Z) - From Multiple-Choice to Extractive QA: A Case Study for English and Arabic [51.13706104333848]
We explore the feasibility of repurposing an existing multilingual dataset for a new NLP task.<n>We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic.<n>We aim to help others adapt our approach for the remaining 120 BELEBELE language variants, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - BiMediX: Bilingual Medical Mixture of Experts LLM [90.3257333861513]
We introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English and Arabic.<n>Our model facilitates a wide range of medical interactions in English and Arabic, including multi-turn chats to inquire about additional details.<n>We propose a semi-automated English-to-Arabic translation pipeline with human refinement to ensure high-quality translations.
arXiv Detail & Related papers (2024-02-20T18:59:26Z) - PULSAR: Pre-training with Extracted Healthcare Terms for Summarising
Patients' Problems and Data Augmentation with Black-box Large Language Models [25.363775123262307]
Automatic summarisation of a patient's problems in the form of a problem list can aid stakeholders in understanding a patient's condition, reducing workload and cognitive bias.
BioNLP 2023 Shared Task 1A focuses on generating a list of diagnoses and problems from the provider's progress notes during hospitalisation.
One component employs large language models (LLMs) for data augmentation; the other is an abstractive summarisation LLM with a novel pre-training objective for generating the patients' problems summarised as a list.
Our approach was ranked second among all submissions to the shared task.
arXiv Detail & Related papers (2023-06-05T10:17:50Z) - PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding.
LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge.
We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z) - ViMQ: A Vietnamese Medical Question Dataset for Healthcare Dialogue
System Development [1.4315915057750197]
We publish a Vietnamese dataset of medical questions from patients with sentence-level and entity-level annotations.
We propose a simple self-supervised training strategy with span-noise modelling that improves the performance.
arXiv Detail & Related papers (2023-04-27T17:59:53Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.