Med-EASi: Finely Annotated Dataset and Models for Controllable
Simplification of Medical Texts
- URL: http://arxiv.org/abs/2302.09155v1
- Date: Fri, 17 Feb 2023 21:50:13 GMT
- Title: Med-EASi: Finely Annotated Dataset and Models for Controllable
Simplification of Medical Texts
- Authors: Chandrayee Basu, Rosni Vasu, Michihiro Yasunaga, Qian Yang
- Abstract summary: Automatic medical text simplification can assist providers with patient-friendly communication and make medical texts more accessible.
We present $textbfMed-EASi$ ($underlinetextbfMed$ical dataset for $underlinetextbfE$laborative and $underlinetextbfA$bstractive $underlinetextbfSi$mplification)
Our results show that our fine-grained annotations improve learning compared to the unannotated baseline.
- Score: 32.57058284812338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic medical text simplification can assist providers with
patient-friendly communication and make medical texts more accessible, thereby
improving health literacy. But curating a quality corpus for this task requires
the supervision of medical experts. In this work, we present
$\textbf{Med-EASi}$ ($\underline{\textbf{Med}}$ical dataset for
$\underline{\textbf{E}}$laborative and $\underline{\textbf{A}}$bstractive
$\underline{\textbf{Si}}$mplification), a uniquely crowdsourced and finely
annotated dataset for supervised simplification of short medical texts. Its
$\textit{expert-layman-AI collaborative}$ annotations facilitate
$\textit{controllability}$ over text simplification by marking four kinds of
textual transformations: elaboration, replacement, deletion, and insertion. To
learn medical text simplification, we fine-tune T5-large with four different
styles of input-output combinations, leading to two control-free and two
controllable versions of the model. We add two types of
$\textit{controllability}$ into text simplification, by using a multi-angle
training approach: $\textit{position-aware}$, which uses in-place annotated
inputs and outputs, and $\textit{position-agnostic}$, where the model only
knows the contents to be edited, but not their positions. Our results show that
our fine-grained annotations improve learning compared to the unannotated
baseline. Furthermore, $\textit{position-aware}$ control generates better
simplification than the $\textit{position-agnostic}$ one. The data and code are
available at https://github.com/Chandrayee/CTRL-SIMP.
Related papers
- MedUnA: Language guided Unsupervised Adaptation of Vision-Language Models for Medical Image Classification [14.725941791069852]
We propose underlineMedical underlineUnsupervised underlineAdaptation (textttMedUnA), constituting two-stage training: Adapter Pre-training, and Unsupervised Learning.
We evaluate the performance of textttMedUnA on three different kinds of data modalities - chest X-rays, eye fundus and skin lesion images.
arXiv Detail & Related papers (2024-09-03T09:25:51Z) - Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG [57.14250086701313]
We investigate the extent to which modern LMs generate $n$-grams from their training data.
We develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data.
arXiv Detail & Related papers (2024-06-18T21:31:19Z) - Text2MDT: Extracting Medical Decision Trees from Medical Texts [33.58610255918941]
We propose a novel task, Text2MDT, to explore the automatic extraction of medical decision trees (MDTs) from medical texts.
We normalize the form of the MDT and create an annotated Text-to-MDT dataset in Chinese with the participation of medical experts.
arXiv Detail & Related papers (2024-01-04T02:33:38Z) - Text Embeddings Reveal (Almost) As Much As Text [86.5822042193058]
We investigate the problem of embedding textitinversion, reconstructing the full text represented in dense text embeddings.
We find that although a na"ive model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover $92%$ of $32text-token$ text inputs exactly.
arXiv Detail & Related papers (2023-10-10T17:39:03Z) - TOPFORMER: Topology-Aware Authorship Attribution of Deepfake Texts with Diverse Writing Styles [14.205559299967423]
Recent advances in Large Language Models (LLMs) have enabled the generation of open-ended high-quality texts, that are non-trivial to distinguish from human-written texts.
Users with malicious intent can easily use these open-sourced LLMs to generate harmful texts and dis/misinformation at scale.
To mitigate this problem, a computational method to determine if a given text is a deepfake text or not is desired.
We propose TopFormer to improve existing AA solutions by capturing more linguistic patterns in deepfake texts.
arXiv Detail & Related papers (2023-09-22T15:32:49Z) - Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection.
Our approach achieves better generation quality according to both automatic and human evaluations.
Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z) - Multilingual Simplification of Medical Texts [49.469685530201716]
We introduce MultiCochrane, the first sentence-aligned multilingual text simplification dataset for the medical domain in four languages.
We evaluate fine-tuned and zero-shot models across these languages, with extensive human assessments and analyses.
Although models can now generate viable simplified texts, we identify outstanding challenges that this dataset might be used to address.
arXiv Detail & Related papers (2023-05-21T18:25:07Z) - AutoMeTS: The Autocomplete for Medical Text Simplification [9.18959130745234]
We introduce a new parallel medical data set consisting of aligned English Wikipedia with Simple English Wikipedia sentences.
We show how the additional context of the sentence to be simplified can be incorporated to achieve better results.
We also introduce an ensemble model that combines the four PNLMs and outperforms the best individual model by 2.1%.
arXiv Detail & Related papers (2020-10-20T19:20:29Z) - All you need is a second look: Towards Tighter Arbitrary shape text
detection [80.85188469964346]
Long curve text instances tend to be fragmented because of the limited receptive field size of CNN.
Simple representations using rectangle or quadrangle bounding boxes fall short when dealing with more challenging arbitrary-shaped texts.
textitNASK reconstructs text instances with a more tighter representation using the predicted geometrical attributes.
arXiv Detail & Related papers (2020-04-26T17:03:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.