belabBERT: a Dutch RoBERTa-based language model applied to psychiatric
classification
- URL: http://arxiv.org/abs/2106.01091v1
- Date: Wed, 2 Jun 2021 11:50:49 GMT
- Title: belabBERT: a Dutch RoBERTa-based language model applied to psychiatric
classification
- Authors: Joppe Wouts, Janna de Boer, Alban Voppel, Sanne Brederoo, Sander van
Splunter and Iris Sommer
- Abstract summary: We present belabBERT, a new Dutch language model extending the RoBERTa architecture.
belabBERT is trained on a large Dutch corpus (+32 GB) of web crawled texts.
We evaluate the strength of text-based classification using belabBERT, and compared the results to the existing RobBERT model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural language processing (NLP) is becoming an important means for
automatic recognition of human traits and states, such as intoxication,
presence of psychiatric disorders, presence of airway disorders and states of
stress. Such applications have the potential to be an important pillar for
online help lines, and may gradually be introduced into eHealth modules.
However, NLP is language specific and for languages such as Dutch, NLP models
are scarce. As a result, recent Dutch NLP models have a low capture of long
range semantic dependencies over sentences. To overcome this, here we present
belabBERT, a new Dutch language model extending the RoBERTa architecture.
belabBERT is trained on a large Dutch corpus (+32 GB) of web crawled texts. We
applied belabBERT to the classification of psychiatric illnesses. First, we
evaluated the strength of text-based classification using belabBERT, and
compared the results to the existing RobBERT model. Then, we compared the
performance of belabBERT to audio classification for psychiatric disorders.
Finally, a brief exploration was performed, extending the framework to a hybrid
text- and audio-based classification. Our results show that belabBERT
outperformed the current best text classification network for Dutch, RobBERT.
belabBERT also outperformed classification based on audio alone.
Related papers
- HalleluBERT: Let every token that has meaning bear its weight [0.0]
We present HalleluBERT, a RoBERTa-based encoder family (base and large) trained from scratch on 49.1GB of deduplicated Hebrew web text and Wikipedia with a Hebrew-specific byte-level BPE vocabulary.
arXiv Detail & Related papers (2025-10-24T11:52:29Z) - Beyond Architectures: Evaluating the Role of Contextual Embeddings in Detecting Bipolar Disorder on Social Media [0.18416014644193066]
bipolar disorder is a chronic mental illness frequently underdiagnosed due to subtle early symptoms and social stigma.<n>This paper explores the advanced natural language processing (NLP) models for recognizing signs of bipolar disorder based on user-generated social media text.
arXiv Detail & Related papers (2025-07-17T05:14:19Z) - GeistBERT: Breathing Life into German NLP [0.22099217573031676]
GeistBERT seeks to improve German language processing by incrementally training on a diverse corpus.<n>The model was trained on a 1.3 TB German corpus with dynamic masking and a fixed sequence length of 512 tokens.<n>It achieved strong results across all tasks, leading among base models and setting a new state-of-the-art (SOTA) in GermEval 2018 fine text classification.
arXiv Detail & Related papers (2025-06-13T15:53:17Z) - SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic
Organization in HuBERT [49.06057768982775]
We show that a syllabic organization emerges in learning sentence-level representation of speech.
We propose a new benchmark task, Spoken Speech ABX, for evaluating sentence-level representation of speech.
arXiv Detail & Related papers (2023-10-16T20:05:36Z) - From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early
Modern French [57.886210204774834]
We present our efforts to develop NLP tools for Early Modern French (historical French from the 16$textth$ to the 18$textth$ centuries).
We present the $textFreEM_textmax$ corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on $textFreEM_textmax$.
arXiv Detail & Related papers (2022-02-18T22:17:22Z) - Towards Efficient NLP: A Standard Evaluation and A Strong Baseline [55.29756535335831]
This work presents ELUE (Efficient Language Understanding Evaluation), a standard evaluation, and a public leaderboard for efficient NLP models.
Along with the benchmark, we also pre-train and release a strong baseline, ElasticBERT, whose elasticity is both static and dynamic.
arXiv Detail & Related papers (2021-10-13T21:17:15Z) - FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances.
We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID.
The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z) - Evaluation of BERT and ALBERT Sentence Embedding Performance on
Downstream NLP Tasks [4.955649816620742]
This paper explores on sentence embedding models for BERT and ALBERT.
We take a modified BERT network with siamese and triplet network structures called Sentence-BERT (SBERT) and replace BERT with ALBERT to create Sentence-ALBERT (SALBERT)
arXiv Detail & Related papers (2021-01-26T09:14:06Z) - GottBERT: a pure German Language Model [0.0]
No German single language RoBERTa model is yet published, which we introduce in this work (GottBERT)
In an evaluation we compare its performance on the two Named Entity Recognition (NER) tasks Conll 2003 and GermEval 2014 as well as on the text classification tasks GermEval 2018 (fine and coarse) and GNAD with existing German single language BERT models and two multilingual ones.
GottBERT was successfully pre-trained on a 256 core TPU pod using the RoBERTa BASE architecture.
arXiv Detail & Related papers (2020-12-03T17:45:03Z) - An Interpretable End-to-end Fine-tuning Approach for Long Clinical Text [72.62848911347466]
Unstructured clinical text in EHRs contains crucial information for applications including decision support, trial matching, and retrospective research.
Recent work has applied BERT-based models to clinical information extraction and text classification, given these models' state-of-the-art performance in other NLP domains.
In this work, we propose a novel fine-tuning approach called SnipBERT. Instead of using entire notes, SnipBERT identifies crucial snippets and feeds them into a truncated BERT-based model in a hierarchical manner.
arXiv Detail & Related papers (2020-11-12T17:14:32Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - Text-based classification of interviews for mental health -- juxtaposing
the state of the art [0.0]
Currently, the state of the art for classification of psychiatric illness is based on audio-based classification.
This thesis aims to design and evaluate a state of the art text classification network on this challenge.
arXiv Detail & Related papers (2020-07-29T16:19:30Z) - TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [113.29476656550342]
We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables.
TaBERT is trained on a large corpus of 26 million tables and their English contexts.
Implementation of the model will be available at http://fburl.com/TaBERT.
arXiv Detail & Related papers (2020-05-17T17:26:40Z) - AraBERT: Transformer-based Model for Arabic Language Understanding [0.0]
We pre-trained BERT specifically for the Arabic language in the pursuit of achieving the same success that BERT did for the English language.
The results showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks.
arXiv Detail & Related papers (2020-02-28T22:59:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.