A Novel Dataset for Financial Education Text Simplification in Spanish
- URL: http://arxiv.org/abs/2312.09897v1
- Date: Fri, 15 Dec 2023 15:47:08 GMT
- Title: A Novel Dataset for Financial Education Text Simplification in Spanish
- Authors: Nelson Perez-Rojas, Saul Calderon-Ramirez, Martin Solis-Salazar, Mario
Romero-Sandoval, Monica Arias-Monge, Horacio Saggion
- Abstract summary: In Spanish, there are few datasets that can be used to create text simplification systems.
We created a dataset with 5,314 complex and simplified sentence pairs using established simplification rules.
- Score: 4.475176409401273
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text simplification, crucial in natural language processing, aims to make
texts more comprehensible, particularly for specific groups like visually
impaired Spanish speakers, a less-represented language in this field. In
Spanish, there are few datasets that can be used to create text simplification
systems. Our research has the primary objective to develop a Spanish financial
text simplification dataset. We created a dataset with 5,314 complex and
simplified sentence pairs using established simplification rules. We also
compared our dataset with the simplifications generated from GPT-3, Tuner, and
MT5, in order to evaluate the feasibility of data augmentation using these
systems. In this manuscript we present the characteristics of our dataset and
the findings of the comparisons with other systems. The dataset is available at
Hugging face, saul1917/FEINA.
Related papers
- MultiLS-SP/CA: Lexical Complexity Prediction and Lexical Simplification Resources for Catalan and Spanish [3.8704030295841534]
This paper presents MultiLS-SP/CA, a novel dataset for lexical simplification in Spanish and Catalan.
This dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification.
arXiv Detail & Related papers (2024-04-11T14:57:19Z) - German Text Simplification: Finetuning Large Language Models with
Semi-Synthetic Data [0.7059555559002345]
This study pioneers the use of synthetically generated data for training generative models in document-level text simplification of German texts.
We finetune Large Language Models with up to 13 billion parameters on this data and evaluate their performance.
arXiv Detail & Related papers (2024-02-16T13:28:44Z) - A Benchmark for Text Expansion: Datasets, Metrics, and Baselines [87.47745669317894]
This work presents a new task of Text Expansion (TE), which aims to insert fine-grained modifier into proper locations of the plain text.
We leverage four complementary approaches to construct a dataset with 12 million automatically generated instances and 2K human-annotated references.
On top of a pre-trained text-infilling model, we build both pipelined and joint Locate&Infill models, which demonstrate the superiority over the Text2Text baselines.
arXiv Detail & Related papers (2023-09-17T07:54:38Z) - A New Dataset and Empirical Study for Sentence Simplification in Chinese [50.0624778757462]
This paper introduces CSS, a new dataset for assessing sentence simplification in Chinese.
We collect manual simplifications from human annotators and perform data analysis to show the difference between English and Chinese sentence simplifications.
In the end, we explore whether Large Language Models can serve as high-quality Chinese sentence simplification systems by evaluating them on CSS.
arXiv Detail & Related papers (2023-06-07T06:47:34Z) - MCTS: A Multi-Reference Chinese Text Simplification Dataset [15.080614581458091]
There has been very little research on Chinese text simplification for a long time.
We introduce MCTS, a multi-reference Chinese text simplification dataset.
We evaluate the performance of several unsupervised methods and advanced large language models.
arXiv Detail & Related papers (2023-06-05T11:46:36Z) - Multilingual Simplification of Medical Texts [49.469685530201716]
We introduce MultiCochrane, the first sentence-aligned multilingual text simplification dataset for the medical domain in four languages.
We evaluate fine-tuned and zero-shot models across these languages, with extensive human assessments and analyses.
Although models can now generate viable simplified texts, we identify outstanding challenges that this dataset might be used to address.
arXiv Detail & Related papers (2023-05-21T18:25:07Z) - Exploiting Summarization Data to Help Text Simplification [50.0624778757462]
We analyzed the similarity between text summarization and text simplification and exploited summarization data to help simplify.
We named these pairs Sum4Simp (S4S) and conducted human evaluations to show that S4S is high-quality.
arXiv Detail & Related papers (2023-02-14T15:32:04Z) - Lexical Simplification Benchmarks for English, Portuguese, and Spanish [23.90236014260585]
We present a new benchmark dataset for lexical simplification in English, Spanish, and (Brazilian) Portuguese.
This is the first dataset that offers a direct comparison of lexical simplification systems for three languages.
We find a state-of-the-art neural lexical simplification system outperforms a state-of-the-art non-neural lexical simplification system in all three languages.
arXiv Detail & Related papers (2022-09-12T15:06:26Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.