MultiLS-SP/CA: Lexical Complexity Prediction and Lexical Simplification   Resources for Catalan and Spanish
        - URL: http://arxiv.org/abs/2404.07814v1
- Date: Thu, 11 Apr 2024 14:57:19 GMT
- Title: MultiLS-SP/CA: Lexical Complexity Prediction and Lexical Simplification   Resources for Catalan and Spanish
- Authors: Stefan Bott, Horacio Saggion, Nelson Peréz Rojas, Martin Solis Salazar, Saul Calderon Ramirez, 
- Abstract summary: This paper presents MultiLS-SP/CA, a novel dataset for lexical simplification in Spanish and Catalan.
This dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification.
- Score: 3.8704030295841534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Automatic lexical simplification is a task to substitute lexical items that may be unfamiliar and difficult to understand with easier and more common words. This paper presents MultiLS-SP/CA, a novel dataset for lexical simplification in Spanish and Catalan. This dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification which is available for Spanish. Specifically, MultiLS-SP is the first dataset for Spanish which includes scalar ratings of the understanding difficulty of lexical items. In addition, we describe experiments with this dataset, which can serve as a baseline for future work on the same data. 
 
      
        Related papers
        - H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasoning on Tables [56.73919743039263]
 This paper introduces a novel algorithm that integrates both symbolic and semantic (textual) approaches in a two-stage process to address limitations.
Our experiments demonstrate that H-STAR significantly outperforms state-of-the-art methods across three question-answering (QA) and fact-verification datasets.
 arXiv  Detail & Related papers  (2024-06-29T21:24:19Z)
- A Novel Dataset for Financial Education Text Simplification in Spanish [4.475176409401273]
 In Spanish, there are few datasets that can be used to create text simplification systems.
We created a dataset with 5,314 complex and simplified sentence pairs using established simplification rules.
 arXiv  Detail & Related papers  (2023-12-15T15:47:08Z)
- Multilingual Controllable Transformer-Based Lexical Simplification [4.718531520078843]
 This paper proposes mTLS, a controllable Transformer-based Lexical Simplification (LS) system fined-tuned with the T5 model.
The novelty of this work lies in the use of language-specific prefixes, control tokens, and candidates extracted from pre-trained masked language models to learn simpler alternatives for complex words.
 arXiv  Detail & Related papers  (2023-07-05T08:48:19Z)
- A New Dataset and Empirical Study for Sentence Simplification in Chinese [50.0624778757462]
 This paper introduces CSS, a new dataset for assessing sentence simplification in Chinese.
We collect manual simplifications from human annotators and perform data analysis to show the difference between English and Chinese sentence simplifications.
In the end, we explore whether Large Language Models can serve as high-quality Chinese sentence simplification systems by evaluating them on CSS.
 arXiv  Detail & Related papers  (2023-06-07T06:47:34Z)
- Multilingual Simplification of Medical Texts [49.469685530201716]
 We introduce MultiCochrane, the first sentence-aligned multilingual text simplification dataset for the medical domain in four languages.
We evaluate fine-tuned and zero-shot models across these languages, with extensive human assessments and analyses.
Although models can now generate viable simplified texts, we identify outstanding challenges that this dataset might be used to address.
 arXiv  Detail & Related papers  (2023-05-21T18:25:07Z)
- Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
 Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
 arXiv  Detail & Related papers  (2022-12-14T13:41:49Z)
- LSA-T: The first continuous Argentinian Sign Language dataset for Sign
  Language Translation [52.87578398308052]
 Sign language translation (SLT) is an active field of study that encompasses human-computer interaction, computer vision, natural language processing and machine learning.
This paper presents the first continuous Argentinian Sign Language (LSA) dataset.
It contains 14,880 sentence level videos of LSA extracted from the CN Sordos YouTube channel with labels and keypoints annotations for each signer.
 arXiv  Detail & Related papers  (2022-11-14T14:46:44Z)
- ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification [17.101023503289856]
 ALEXSIS-PT is a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words.
We evaluate four models for substitute generation on this dataset, namely mDistilBERT, mBERT, XLM-R, and BERTimbau.
 arXiv  Detail & Related papers  (2022-09-19T14:10:21Z)
- Lexical Simplification Benchmarks for English, Portuguese, and Spanish [23.90236014260585]
 We present a new benchmark dataset for lexical simplification in English, Spanish, and (Brazilian) Portuguese.
This is the first dataset that offers a direct comparison of lexical simplification systems for three languages.
We find a state-of-the-art neural lexical simplification system outperforms a state-of-the-art non-neural lexical simplification system in all three languages.
 arXiv  Detail & Related papers  (2022-09-12T15:06:26Z)
- Multilingual Extraction and Categorization of Lexical Collocations with
  Graph-aware Transformers [86.64972552583941]
 We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
 arXiv  Detail & Related papers  (2022-05-23T16:47:37Z)
- Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
  Encoders [85.80950708769923]
 We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
 arXiv  Detail & Related papers  (2022-04-30T13:23:16Z)
- Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
 In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
 arXiv  Detail & Related papers  (2022-04-28T14:02:16Z)
- Automatic Lexical Simplification for Turkish [0.0]
 We present the first automatic lexical simplification system for the Turkish language.
Recent text simplification efforts rely on manually crafted simplified corpora and comprehensive NLP tools.
We present a new text simplification pipeline based on pretrained representation model BERT together with morphological features to generate grammatically correct and semantically appropriate word-level simplifications.
 arXiv  Detail & Related papers  (2022-01-15T15:58:44Z)
- Predicting Lexical Complexity in English Texts [6.556254680121433]
 The first step in most text simplification is to predict which words are considered complex for a given target population.
This task is commonly referred to as Complex Word Identification (CWI) and it is often modelled as a supervised classification problem.
For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required.
 arXiv  Detail & Related papers  (2021-02-17T14:05:30Z)
- Chinese Lexical Simplification [29.464388721085548]
 There is no research work for Chinese lexical simplification ( CLS) task.
To circumvent difficulties in acquiring annotations, we manually create the first benchmark dataset for CLS.
We present five different types of methods as baselines to generate substitute candidates for the complex word.
 arXiv  Detail & Related papers  (2020-10-14T12:55:36Z)
- Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
  Lexical Semantic Similarity [67.36239720463657]
 Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
 arXiv  Detail & Related papers  (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.