CARMA: Comprehensive Automatically-annotated Reddit Mental Health Dataset for Arabic
- URL: http://arxiv.org/abs/2511.03102v1
- Date: Wed, 05 Nov 2025 01:17:43 GMT
- Title: CARMA: Comprehensive Automatically-annotated Reddit Mental Health Dataset for Arabic
- Authors: Saad Mankarious, Ayah Zirikly,
- Abstract summary: We present CARMA, the first automatically annotated large-scale dataset of Arabic Reddit posts.<n>The dataset encompasses six mental health conditions, such as Anxiety, Autism, and Depression, and a control group.<n>We conduct qualitative and quantitative analyses of lexical and semantic differences between users, providing insights into the linguistic markers of specific mental health conditions.
- Score: 1.3320917259299652
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mental health disorders affect millions worldwide, yet early detection remains a major challenge, particularly for Arabic-speaking populations where resources are limited and mental health discourse is often discouraged due to cultural stigma. While substantial research has focused on English-language mental health detection, Arabic remains significantly underexplored, partly due to the scarcity of annotated datasets. We present CARMA, the first automatically annotated large-scale dataset of Arabic Reddit posts. The dataset encompasses six mental health conditions, such as Anxiety, Autism, and Depression, and a control group. CARMA surpasses existing resources in both scale and diversity. We conduct qualitative and quantitative analyses of lexical and semantic differences between users, providing insights into the linguistic markers of specific mental health conditions. To demonstrate the dataset's potential for further mental health analysis, we perform classification experiments using a range of models, from shallow classifiers to large language models. Our results highlight the promise of advancing mental health detection in underrepresented languages such as Arabic.
Related papers
- MindSET: Advancing Mental Health Benchmarking through Large-Scale Social Media Data [29.110680511845327]
We present a new benchmark dataset, textbfMindSET, curated from Reddit using self-reported diagnoses to address limitations.<n>The annotated dataset contains over textbf13M annotated posts across seven mental health conditions, more than twice the size of previous benchmarks.
arXiv Detail & Related papers (2025-11-14T16:06:04Z) - A Comprehensive Review of Datasets for Clinical Mental Health AI Systems [55.67299586253951]
We present the first comprehensive survey of clinical mental health datasets relevant to the training and development of AI-powered clinical assistants.<n>Our survey identifies critical gaps such as a lack of longitudinal data, limited cultural and linguistic representation, inconsistent collection and annotation standards, and a lack of modalities in synthetic data.
arXiv Detail & Related papers (2025-08-13T13:42:35Z) - A Survey on Multilingual Mental Disorders Detection from Social Media Data [19.167802086240293]
We present the first survey on the detection of mental health disorders using multilingual social media data.<n>We investigate the cultural nuances that influence online language patterns and self-disclosure behaviors.<n>We provide a comprehensive list of multilingual data collections that can be used for developing NLP models for mental health screening.
arXiv Detail & Related papers (2025-05-21T14:15:54Z) - MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders [59.515827458631975]
Mental health disorders are one of the most serious diseases in the world.<n>Privacy concerns limit the accessibility of personalized treatment data.<n>MentalArena is a self-play framework to train language models.
arXiv Detail & Related papers (2024-10-09T13:06:40Z) - Building Multilingual Datasets for Predicting Mental Health Severity through LLMs: Prospects and Challenges [3.0382033111760585]
Large Language Models (LLMs) are increasingly being integrated into various medical fields, including mental health support systems.<n>We present a novel multilingual adaptation of widely-used mental health datasets, translated from English into six languages.<n>This dataset enables a comprehensive evaluation of LLM performance in detecting mental health conditions and assessing their severity across multiple languages.
arXiv Detail & Related papers (2024-09-25T22:14:34Z) - MentalQA: An Annotated Arabic Corpus for Questions and Answers of Mental Healthcare [0.1638581561083717]
MentalQA is a novel Arabic dataset featuring conversational-style question-and-answer (QA) interactions.
Data was collected from a question-answering medical platform.
MentalQA offers a valuable foundation for developing Arabic text mining tools capable of supporting mental health professionals and individuals seeking information.
arXiv Detail & Related papers (2024-05-21T09:16:38Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - DEPAC: a Corpus for Depression and Anxiety Detection from Speech [3.2154432166999465]
We introduce a novel mental distress analysis audio dataset DEPAC, labeled based on established thresholds on depression and anxiety screening tools.
This large dataset comprises multiple speech tasks per individual, as well as relevant demographic information.
We present a feature set consisting of hand-curated acoustic and linguistic features, which were found effective in identifying signs of mental illnesses in human speech.
arXiv Detail & Related papers (2023-06-20T12:21:06Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Learning Language and Multimodal Privacy-Preserving Markers of Mood from
Mobile Data [74.60507696087966]
Mental health conditions remain underdiagnosed even in countries with common access to advanced medical care.
One promising data source to help monitor human behavior is daily smartphone usage.
We study behavioral markers of daily mood using a recent dataset of mobile behaviors from adolescent populations at high risk of suicidal behaviors.
arXiv Detail & Related papers (2021-06-24T17:46:03Z) - Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset
for Personality Assessment [50.15466026089435]
We present a novel peer-to-peer Hindi conversation dataset- Vyaktitv.
It consists of high-quality audio and video recordings of the participants, with Hinglish textual transcriptions for each conversation.
The dataset also contains a rich set of socio-demographic features, like income, cultural orientation, amongst several others, for all the participants.
arXiv Detail & Related papers (2020-08-31T17:44:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.