Multilingual Simplification of Medical Texts
- URL: http://arxiv.org/abs/2305.12532v4
- Date: Wed, 18 Oct 2023 04:20:05 GMT
- Title: Multilingual Simplification of Medical Texts
- Authors: Sebastian Joseph, Kathryn Kazanas, Keziah Reina, Vishnesh J.
Ramanathan, Wei Xu, Byron C. Wallace, and Junyi Jessy Li
- Abstract summary: We introduce MultiCochrane, the first sentence-aligned multilingual text simplification dataset for the medical domain in four languages.
We evaluate fine-tuned and zero-shot models across these languages, with extensive human assessments and analyses.
Although models can now generate viable simplified texts, we identify outstanding challenges that this dataset might be used to address.
- Score: 49.469685530201716
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated text simplification aims to produce simple versions of complex
texts. This task is especially useful in the medical domain, where the latest
medical findings are typically communicated via complex and technical articles.
This creates barriers for laypeople seeking access to up-to-date medical
findings, consequently impeding progress on health literacy. Most existing work
on medical text simplification has focused on monolingual settings, with the
result that such evidence would be available only in just one language (most
often, English). This work addresses this limitation via multilingual
simplification, i.e., directly simplifying complex texts into simplified texts
in multiple languages. We introduce MultiCochrane, the first sentence-aligned
multilingual text simplification dataset for the medical domain in four
languages: English, Spanish, French, and Farsi. We evaluate fine-tuned and
zero-shot models across these languages, with extensive human assessments and
analyses. Although models can now generate viable simplified texts, we identify
outstanding challenges that this dataset might be used to address.
Related papers
- Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain [19.58987478434808]
We present Medical mT5, the first open-source text-to-text multilingual model for the medical domain.
A comprehensive evaluation shows that Medical mT5 outperforms both encoders and similarly sized text-to-text models for the Spanish, French, and Italian benchmarks.
arXiv Detail & Related papers (2024-04-11T10:01:32Z) - A Novel Dataset for Financial Education Text Simplification in Spanish [4.475176409401273]
In Spanish, there are few datasets that can be used to create text simplification systems.
We created a dataset with 5,314 complex and simplified sentence pairs using established simplification rules.
arXiv Detail & Related papers (2023-12-15T15:47:08Z) - A New Dataset and Empirical Study for Sentence Simplification in Chinese [50.0624778757462]
This paper introduces CSS, a new dataset for assessing sentence simplification in Chinese.
We collect manual simplifications from human annotators and perform data analysis to show the difference between English and Chinese sentence simplifications.
In the end, we explore whether Large Language Models can serve as high-quality Chinese sentence simplification systems by evaluating them on CSS.
arXiv Detail & Related papers (2023-06-07T06:47:34Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Cross-lingual Argument Mining in the Medical Domain [6.0158981171030685]
We show how to perform Argument Mining (AM) in medical texts for which no annotated data is available.
Our work shows that automatically translating and projecting annotations (data-transfer) from English to a given target language is an effective way to generate annotated data.
We also show how the automatically generated data in Spanish can also be used to improve results in the original English monolingual setting.
arXiv Detail & Related papers (2023-01-25T11:21:12Z) - Lexical Simplification Benchmarks for English, Portuguese, and Spanish [23.90236014260585]
We present a new benchmark dataset for lexical simplification in English, Spanish, and (Brazilian) Portuguese.
This is the first dataset that offers a direct comparison of lexical simplification systems for three languages.
We find a state-of-the-art neural lexical simplification system outperforms a state-of-the-art non-neural lexical simplification system in all three languages.
arXiv Detail & Related papers (2022-09-12T15:06:26Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - Paragraph-level Simplification of Medical Texts [35.650619024498425]
Manual simplification does not scale to the rapidly growing body of biomedical literature.
We introduce a new corpus of parallel texts in English comprising technical and lay summaries of all published evidence pertaining to different clinical topics.
We propose a new metric based on likelihood scores from a masked language model pretrained on scientific texts.
arXiv Detail & Related papers (2021-04-12T18:56:05Z) - Enabling Language Models to Fill in the Blanks [81.59381915581892]
We present a simple approach for text infilling, the task of predicting missing spans of text at any position in a document.
We train (or fine-tune) off-the-shelf language models on sequences containing the concatenation of artificially-masked text and the text which was masked.
We show that this approach, which we call infilling by language modeling, can enable LMs to infill entire sentences effectively on three different domains: short stories, scientific abstracts, and lyrics.
arXiv Detail & Related papers (2020-05-11T18:00:03Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.