New Arabic Medical Dataset for Diseases Classification
- URL: http://arxiv.org/abs/2106.15236v2
- Date: Wed, 30 Jun 2021 10:45:54 GMT
- Title: New Arabic Medical Dataset for Diseases Classification
- Authors: Jaafar Hammoud, Aleksandra Vatian, Natalia Dobrenko, Nikolai
Vedernikov, Anatoly Shalyto, Natalia Gusarova
- Abstract summary: We introduce a new Arab medical dataset, which includes two thousand medical documents collected from several Arabic medical websites.
The dataset was built for the task of classifying texts and includes 10 classes (Blood, Bone, Cardiovascular, Ear, Endocrine, Eye, Gastrointestinal, Immune, Liver and Nephrological)
Experiments on the dataset were performed by fine-tuning three pre-trained models: BERT from Google, Arabert that based on BERT with large Arabic corpus, and AraBioNER that based on Arabert with Arabic medical corpus.
- Score: 55.41644538483948
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Arabic language suffers from a great shortage of datasets suitable for
training deep learning models, and the existing ones include general
non-specialized classifications. In this work, we introduce a new Arab medical
dataset, which includes two thousand medical documents collected from several
Arabic medical websites, in addition to the Arab Medical Encyclopedia. The
dataset was built for the task of classifying texts and includes 10 classes
(Blood, Bone, Cardiovascular, Ear, Endocrine, Eye, Gastrointestinal, Immune,
Liver and Nephrological) diseases. Experiments on the dataset were performed by
fine-tuning three pre-trained models: BERT from Google, Arabert that based on
BERT with large Arabic corpus, and AraBioNER that based on Arabert with Arabic
medical corpus.
Related papers
- AlcLaM: Arabic Dialectal Language Model [2.8477895544986955]
We construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms.
We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch.
Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models.
arXiv Detail & Related papers (2024-07-18T02:13:50Z) - Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition [0.0]
A few research studies have studied the three Arabic Wikipedia editions, Arabic Wikipedia (AR), Egyptian Arabic Wikipedia (ARZ), and Moroccan Arabic Wikipedia (ARY)
We aim to mitigate the problem of template translation that occurred in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics.
arXiv Detail & Related papers (2024-03-31T05:14:38Z) - BiMediX: Bilingual Medical Mixture of Experts LLM [94.85518237963535]
We introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English and Arabic.
Our model facilitates a wide range of medical interactions in English and Arabic, including multi-turn chats to inquire about additional details.
We propose a semi-automated English-to-Arabic translation pipeline with human refinement to ensure high-quality translations.
arXiv Detail & Related papers (2024-02-20T18:59:26Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - RuBioRoBERTa: a pre-trained biomedical language model for Russian
language biomedical text mining [117.56261821197741]
We present several BERT-based models for Russian language biomedical text mining.
The models are pre-trained on a corpus of freely available texts in the Russian biomedical domain.
arXiv Detail & Related papers (2022-04-08T09:18:59Z) - A Deep CNN Architecture with Novel Pooling Layer Applied to Two Sudanese
Arabic Sentiment Datasets [1.1034493405536276]
Two new publicly available datasets are introduced, the 2-Class Sudanese Sentiment dataset and the 3-Class Sudanese Sentiment dataset.
A CNN architecture, SCM, is proposed, comprising five CNN layers together with a novel pooling layer, MMA, to extract the best features.
The proposed model is applied to the existing Saudi Sentiment dataset and to the MSA Hotel Arabic Review dataset with accuracies 85.55% and 90.01%.
arXiv Detail & Related papers (2022-01-29T21:33:28Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - A Benchmark Arabic Dataset for Commonsense Explanation [0.6091702876917281]
This paper presents a benchmark Arabic dataset for commonsense explanation.
The dataset consists of Arabic sentences that does not make sense along with three choices to select among them the one that explains why the sentence is false.
arXiv Detail & Related papers (2020-12-18T14:07:10Z) - Predicting Clinical Diagnosis from Patients Electronic Health Records
Using BERT-based Neural Networks [62.9447303059342]
We show the importance of this problem in medical community.
We present a modification of Bidirectional Representations from Transformers (BERT) model for classification sequence.
We use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits.
arXiv Detail & Related papers (2020-07-15T09:22:55Z) - AraDIC: Arabic Document Classification using Image-Based Character
Embeddings and Class-Balanced Loss [7.734726150561088]
We propose a novel end-to-end Arabic document classification framework, Arabic document image-based classifier (AraDIC)
AraDIC consists of an image-based character encoder and a classifier. They are trained in an end-to-end fashion using the class balanced loss to deal with the long-tailed data distribution problem.
To the best of our knowledge, this is the first image-based character embedding framework addressing the problem of Arabic text classification.
arXiv Detail & Related papers (2020-06-20T14:25:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.