RuMedBench: A Russian Medical Language Understanding Benchmark
- URL: http://arxiv.org/abs/2201.06499v1
- Date: Mon, 17 Jan 2022 16:23:33 GMT
- Title: RuMedBench: A Russian Medical Language Understanding Benchmark
- Authors: Pavel Blinov, Arina Reshetnikova, Aleksandr Nesterov, Galina Zubkova,
Vladimir Kokh
- Abstract summary: The paper describes the open Russian medical language understanding benchmark covering several task types.
We prepare the unified format labeling, data split, and evaluation metrics for new tasks.
A single-number metric expresses a model's ability to cope with the benchmark.
- Score: 58.99199480170909
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The paper describes the open Russian medical language understanding benchmark
covering several task types (classification, question answering, natural
language inference, named entity recognition) on a number of novel text sets.
Given the sensitive nature of the data in healthcare, such a benchmark
partially closes the problem of Russian medical dataset absence. We prepare the
unified format labeling, data split, and evaluation metrics for new tasks. The
remaining tasks are from existing datasets with a few modifications. A
single-number metric expresses a model's ability to cope with the benchmark.
Moreover, we implement several baseline models, from simple ones to neural
networks with transformer architecture, and release the code. Expectedly, the
more advanced models yield better performance, but even a simple model is
enough for a decent result in some tasks. Furthermore, for all tasks, we
provide a human evaluation. Interestingly the models outperform humans in the
large-scale classification tasks. However, the advantage of natural
intelligence remains in the tasks requiring more knowledge and reasoning.
Related papers
- The Empirical Impact of Data Sanitization on Language Models [1.1359551336076306]
This paper empirically analyzes the effects of data sanitization across several benchmark language-modeling tasks.
Our results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, typically around 1-5%.
For tasks such as comprehension Q&A there is a big drop of >25% in performance observed in redacted queries as compared to the original.
arXiv Detail & Related papers (2024-11-08T21:22:37Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - FERMAT: An Alternative to Accuracy for Numerical Reasoning [11.893004722079557]
numerical reasoning is measured using a single score on existing datasets.
We introduce a multi-view evaluation set for numerical reasoning in English, called FERMAT.
FerMAT evaluates models on various key numerical reasoning aspects such as number understanding, mathematical operations, and training dependency.
arXiv Detail & Related papers (2023-05-27T15:00:45Z) - Multitask Prompted Training Enables Zero-Shot Task Generalization [70.12770442071657]
We develop a system for mapping general natural language tasks into a human-readable prompted form.
We fine-tune a pretrained encoder-decoder model on this multitask mixture covering a wide variety of tasks.
The model attains strong zero-shot performance on several standard datasets, often outperforming models 16x its size.
arXiv Detail & Related papers (2021-10-15T17:08:57Z) - Cross-lingual Approach to Abstractive Summarization [0.0]
Cross-lingual model transfers are successfully applied in low-resource languages.
We used a pretrained English summarization model based on deep neural networks and sequence-to-sequence architecture.
We developed several models with different proportions of target language data for fine-tuning.
arXiv Detail & Related papers (2020-12-08T09:30:38Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.