An analysis of full-size Russian complexly NER labelled corpus of
Internet user reviews on the drugs based on deep learning and language neural
nets
- URL: http://arxiv.org/abs/2105.00059v1
- Date: Fri, 30 Apr 2021 19:46:24 GMT
- Title: An analysis of full-size Russian complexly NER labelled corpus of
Internet user reviews on the drugs based on deep learning and language neural
nets
- Authors: Alexander Sboev, Sanna Sboeva, Ivan Moloshnikov, Artem Gryaznov, Roman
Rybka, Alexander Naumov, Anton Selivanov, Gleb Rylkov, Viacheslav Ilyin
- Abstract summary: We present the full-size Russian complexly NER-labeled corpus of Internet user reviews.
A set of advanced deep learning neural networks is used to extract pharmacologically meaningful entities from Russian texts.
- Score: 94.37521840642141
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present the full-size Russian complexly NER-labeled corpus of Internet
user reviews, along with an evaluation of accuracy levels reached on this
corpus by a set of advanced deep learning neural networks to extract the
pharmacologically meaningful entities from Russian texts. The corpus annotation
includes mentions of the following entities: Medication (33005 mentions),
Adverse Drug Reaction (1778), Disease (17403), and Note (4490). Two of them -
Medication and Disease - comprise a set of attributes. A part of the corpus has
the coreference annotation with 1560 coreference chains in 300 documents.
Special multi-label model based on a language model and the set of features is
developed, appropriate for presented corpus labeling. The influence of the
choice of different modifications of the models: word vector representations,
types of language models pre-trained for Russian, text normalization styles,
and other preliminary processing are analyzed. The sufficient size of our
corpus allows to study the effects of particularities of corpus labeling and
balancing entities in the corpus. As a result, the state of the art for the
pharmacological entity extraction problem for Russian is established on a
full-size labeled corpus. In case of the adverse drug reaction (ADR)
recognition, it is 61.1 by the F1-exact metric that, as our analysis shows, is
on par with the accuracy level for other language corpora with similar
characteristics and the ADR representativnes. The evaluated baseline precision
of coreference relation extraction on the corpus is 71, that is higher the
results reached on other Russian corpora.
Related papers
- FASSILA: A Corpus for Algerian Dialect Fake News Detection and Sentiment Analysis [0.0]
The Algerian dialect (AD) faces challenges due to the absence of annotated corpora.
This study outlines the development process of a specialized corpus for Fake News (FN) detection and sentiment analysis (SA) in AD called FASSILA.
arXiv Detail & Related papers (2024-11-07T10:39:10Z) - RaTEScore: A Metric for Radiology Report Generation [59.37561810438641]
This paper introduces a novel, entity-aware metric, as Radiological Report (Text) Evaluation (RaTEScore)
RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions.
Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
arXiv Detail & Related papers (2024-06-24T17:49:28Z) - Biomedical Entity Linking for Dutch: Fine-tuning a Self-alignment BERT Model on an Automatically Generated Wikipedia Corpus [2.4686585810894477]
This paper presents the first evaluated biomedical entity linking model for the Dutch language.
We derive a corpus from Wikipedia of ontology-linked Dutch biomedical entities in context.
Our results indicate that biomedical entity linking in a language other than English remains challenging.
arXiv Detail & Related papers (2024-05-20T10:30:36Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - RuBioRoBERTa: a pre-trained biomedical language model for Russian
language biomedical text mining [117.56261821197741]
We present several BERT-based models for Russian language biomedical text mining.
The models are pre-trained on a corpus of freely available texts in the Russian biomedical domain.
arXiv Detail & Related papers (2022-04-08T09:18:59Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - The Russian Drug Reaction Corpus and Neural Models for Drug Reactions
and Effectiveness Detection in User Reviews [13.428173157465062]
The Russian Drug Reaction Corpus (RuDReC) is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products.
The raw part includes 1.4 million health-related user-generated texts collected from various Internet sources.
The labelled part contains 500 consumer reviews about drug therapy with drug- and disease-related information.
arXiv Detail & Related papers (2020-04-07T19:26:13Z) - NUBES: A Corpus of Negation and Uncertainty in Spanish Clinical Texts [5.424799109837065]
This paper introduces the first version of the NUBes corpus (Negation and Uncertainty annotations in Biomedical texts in Spanish)
The corpus is part of an on-going research and currently consists of 29,682 sentences obtained from anonymised health records annotated with negation and uncertainty.
arXiv Detail & Related papers (2020-04-02T15:51:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.