!MSA at AraHealthQA 2025 Shared Task: Enhancing LLM Performance for Arabic Clinical Question Answering through Prompt Engineering and Ensemble Learning
- URL: http://arxiv.org/abs/2509.11365v1
- Date: Sun, 14 Sep 2025 17:39:58 GMT
- Title: !MSA at AraHealthQA 2025 Shared Task: Enhancing LLM Performance for Arabic Clinical Question Answering through Prompt Engineering and Ensemble Learning
- Authors: Mohamed Tarek, Seif Ahmed, Mohamed Basem,
- Abstract summary: We present our systems for Track 2 (General Arabic Health QA, MedArabiQ) of the AraHealthQA-2025 shared task.<n>Our methodology secured 2nd place in both Sub-Task 1 (multiple-choice question answering) and Sub-Task 2 (open-ended question answering) in Arabic clinical contexts.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We present our systems for Track 2 (General Arabic Health QA, MedArabiQ) of the AraHealthQA-2025 shared task, where our methodology secured 2nd place in both Sub-Task 1 (multiple-choice question answering) and Sub-Task 2 (open-ended question answering) in Arabic clinical contexts. For Sub-Task 1, we leverage the Gemini 2.5 Flash model with few-shot prompting, dataset preprocessing, and an ensemble of three prompt configurations to improve classification accuracy on standard, biased, and fill-in-the-blank questions. For Sub-Task 2, we employ a unified prompt with the same model, incorporating role-playing as an Arabic medical expert, few-shot examples, and post-processing to generate concise responses across fill-in-the-blank, patient-doctor Q&A, GEC, and paraphrased variants.
Related papers
- AraHealthQA 2025: The First Shared Task on Arabic Health Question Answering [23.830127107611744]
AraHealthQA 2025, the Comprehensive Arabic Health Question Answering Shared Task, held in conjunction with ArabicNLP 2025 (co-located with EMNLP 2025)<n>MentalQA, focusing on Arabic mental health Q&A (e.g., anxiety, depression, stigma reduction), and MedArabiQ, covering broader medical domains such as internal medicine, pediatrics, and clinical decision making.<n>We outline the dataset creation, task design and evaluation framework, participation statistics, baseline systems, and summarize the overall outcomes.
arXiv Detail & Related papers (2025-08-27T16:54:09Z) - MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian [50.767415194856135]
We introduce MedQARo, the first large-scale medical QA benchmark in Romanian.<n>We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients.
arXiv Detail & Related papers (2025-08-22T13:48:37Z) - Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks [1.3521447196536418]
This research examines the degree to which state-of-the-art large language models demonstrate and articulate healthcare knowledge in Arabic.<n>We benchmark several LLMs using a medical dataset proposed in the Arabic NLP AraHealthQA challenge in MedArabiQ2025 track.<n>Our results reveal significant variations in correct answer prediction accuracy and low variations in semantic alignment of generated answers.
arXiv Detail & Related papers (2025-08-13T10:41:17Z) - Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning [51.203811759364925]
mKGQAgent breaks down the task of converting natural language questions into SPARQL queries into modular, interpretable subtasks.<n> Evaluated on the DBpedia- and Corporate-based KGQA benchmarks within the Text2SPARQL challenge 2025, our approach took first place among the other participants.
arXiv Detail & Related papers (2025-07-22T19:23:03Z) - MSA at SemEval-2025 Task 3: High Quality Weak Labeling and LLM Ensemble Verification for Multilingual Hallucination Detection [0.0]
This paper describes our submission for SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes.<n>The task involves detecting hallucinated spans in text generated by instruction-tuned Large Language Models (LLMs) across multiple languages.<n>Our system ranked 1st in Arabic and Basque, 2nd in German, Swedish, and Finnish, and 3rd in Czech, Farsi, and French.
arXiv Detail & Related papers (2025-05-27T08:26:17Z) - GenAI Content Detection Task 1: English and Multilingual Machine-Generated Text Detection: AI vs. Human [71.42669028683741]
We present a shared task on binary machine generated text detection conducted as a part of the GenAI workshop at COLING 2025.<n>The task consists of two subtasks: Monolingual (English) and Multilingual.<n>We provide a comprehensive overview of the data, a summary of the results, detailed descriptions of the participating systems, and an in-depth analysis of submissions.
arXiv Detail & Related papers (2025-01-19T11:11:55Z) - From Multiple-Choice to Extractive QA: A Case Study for English and Arabic [51.13706104333848]
We explore the feasibility of repurposing an existing multilingual dataset for a new NLP task.<n>We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic.<n>We aim to help others adapt our approach for the remaining 120 BELEBELE language variants, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - SummQA at MEDIQA-Chat 2023:In-Context Learning with GPT-4 for Medical
Summarization [5.92318236682798]
We present a novel system for the Dialogue2Note Medical Summarization tasks in the MEDIQA 2023 Shared Task.
Our approach for section-wise summarization (Task A) is a two-stage process of selecting semantically similar dialogues and using the top-k similar dialogues as in-context examples for GPT-4.
For full-note summarization (Task B), we use a similar solution with k=1.
arXiv Detail & Related papers (2023-06-30T03:14:04Z) - Winning the CVPR'2022 AQTC Challenge: A Two-stage Function-centric
Approach [51.424201533529114]
Affordance-centric Question-driven Task Completion for Egocentric Assistant(AQTC) is a novel task which helps AI assistant learn from instructional videos and scripts and guide the user step-by-step.
We deal with the AQTC via a two-stage Function-centric approach, which consists of Question2Function Module to ground the question with the related function and Function2Answer Module to predict the action based on the historical steps.
arXiv Detail & Related papers (2022-06-20T07:02:23Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.