Lexical Simplification Benchmarks for English, Portuguese, and Spanish
- URL: http://arxiv.org/abs/2209.05301v1
- Date: Mon, 12 Sep 2022 15:06:26 GMT
- Title: Lexical Simplification Benchmarks for English, Portuguese, and Spanish
- Authors: Sanja Stajner, Daniel Ferres, Matthew Shardlow, Kai North, Marcos
Zampieri, Horacio Saggion
- Abstract summary: We present a new benchmark dataset for lexical simplification in English, Spanish, and (Brazilian) Portuguese.
This is the first dataset that offers a direct comparison of lexical simplification systems for three languages.
We find a state-of-the-art neural lexical simplification system outperforms a state-of-the-art non-neural lexical simplification system in all three languages.
- Score: 23.90236014260585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Even in highly-developed countries, as many as 15-30\% of the population can
only understand texts written using a basic vocabulary. Their understanding of
everyday texts is limited, which prevents them from taking an active role in
society and making informed decisions regarding healthcare, legal
representation, or democratic choice. Lexical simplification is a natural
language processing task that aims to make text understandable to everyone by
replacing complex vocabulary and expressions with simpler ones, while
preserving the original meaning. It has attracted considerable attention in the
last 20 years, and fully automatic lexical simplification systems have been
proposed for various languages. The main obstacle for the progress of the field
is the absence of high-quality datasets for building and evaluating lexical
simplification systems. We present a new benchmark dataset for lexical
simplification in English, Spanish, and (Brazilian) Portuguese, and provide
details about data selection and annotation procedures. This is the first
dataset that offers a direct comparison of lexical simplification systems for
three languages. To showcase the usability of the dataset, we adapt two
state-of-the-art lexical simplification systems with differing architectures
(neural vs.\ non-neural) to all three languages (English, Spanish, and
Brazilian Portuguese) and evaluate their performances on our new dataset. For a
fairer comparison, we use several evaluation measures which capture varied
aspects of the systems' efficacy, and discuss their strengths and weaknesses.
We find a state-of-the-art neural lexical simplification system outperforms a
state-of-the-art non-neural lexical simplification system in all three
languages. More importantly, we find that the state-of-the-art neural lexical
simplification systems perform significantly better for English than for
Spanish and Portuguese.
Related papers
- MultiLS-SP/CA: Lexical Complexity Prediction and Lexical Simplification Resources for Catalan and Spanish [3.8704030295841534]
This paper presents MultiLS-SP/CA, a novel dataset for lexical simplification in Spanish and Catalan.
This dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification.
arXiv Detail & Related papers (2024-04-11T14:57:19Z) - A Novel Dataset for Financial Education Text Simplification in Spanish [4.475176409401273]
In Spanish, there are few datasets that can be used to create text simplification systems.
We created a dataset with 5,314 complex and simplified sentence pairs using established simplification rules.
arXiv Detail & Related papers (2023-12-15T15:47:08Z) - Gaze-Driven Sentence Simplification for Language Learners: Enhancing
Comprehension and Readability [11.50011780498048]
This paper presents a novel gaze-driven sentence simplification system designed to enhance reading comprehension.
Our system incorporates machine learning models tailored to individual learners, combining eye gaze features and linguistic features to assess sentence comprehension.
arXiv Detail & Related papers (2023-09-30T12:18:31Z) - ARTIST: ARTificial Intelligence for Simplified Text [5.095775294664102]
Text Simplification is a key Natural Language Processing task that aims for reducing the linguistic complexity of a text.
Recent advances in Generative Artificial Intelligence (AI) have enabled automatic text simplification both on the lexical and syntactical levels.
arXiv Detail & Related papers (2023-08-25T16:06:06Z) - A New Dataset and Empirical Study for Sentence Simplification in Chinese [50.0624778757462]
This paper introduces CSS, a new dataset for assessing sentence simplification in Chinese.
We collect manual simplifications from human annotators and perform data analysis to show the difference between English and Chinese sentence simplifications.
In the end, we explore whether Large Language Models can serve as high-quality Chinese sentence simplification systems by evaluating them on CSS.
arXiv Detail & Related papers (2023-06-07T06:47:34Z) - Multilingual Simplification of Medical Texts [49.469685530201716]
We introduce MultiCochrane, the first sentence-aligned multilingual text simplification dataset for the medical domain in four languages.
We evaluate fine-tuned and zero-shot models across these languages, with extensive human assessments and analyses.
Although models can now generate viable simplified texts, we identify outstanding challenges that this dataset might be used to address.
arXiv Detail & Related papers (2023-05-21T18:25:07Z) - Expanding Pretrained Models to Thousands More Languages via
Lexicon-based Adaptation [133.7313847857935]
Our study highlights how NLP methods can be adapted to thousands more languages that are under-served by current technology.
For 19 under-represented languages across 3 tasks, our methods lead to consistent improvements of up to 5 and 15 points with and without extra monolingual text respectively.
arXiv Detail & Related papers (2022-03-17T16:48:22Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.