Related papers: MultiLS-SP/CA: Lexical Complexity Prediction and Lexical Simplification Resources for Catalan and Spanish

Related papers

Modeling Topics and Sociolinguistic Variation in Code-Switched Discourse: Insights from Spanish-English and Spanish-Guaraní [1.0248720782518987]
This study presents an LLM-assisted annotation pipeline for the sociolinguistic and topical analysis of bilingual discourse in two typologically distinct contexts: Spanish-English and Spanish-Guaran.<n>Using large language models, we automatically labeled topic, genre, and discourse-pragmatic functions across a total of 3,691 code-switched sentences.<n>The resulting distributions reveal systematic links between gender, language dominance, and discourse function in the Miami data, and a clear diglossic division between formal Guaran and informal Spanish in Paraguayan texts.
arXiv Detail & Related papers (2025-12-03T00:56:27Z)
H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasoning on Tables [56.73919743039263]
This paper introduces a novel algorithm that integrates both symbolic and semantic (textual) approaches in a two-stage process to address limitations. Our experiments demonstrate that H-STAR significantly outperforms state-of-the-art methods across three question-answering (QA) and fact-verification datasets.
arXiv Detail & Related papers (2024-06-29T21:24:19Z)
A Novel Dataset for Financial Education Text Simplification in Spanish [4.475176409401273]
In Spanish, there are few datasets that can be used to create text simplification systems. We created a dataset with 5,314 complex and simplified sentence pairs using established simplification rules.
arXiv Detail & Related papers (2023-12-15T15:47:08Z)
Multilingual Controllable Transformer-Based Lexical Simplification [4.718531520078843]
This paper proposes mTLS, a controllable Transformer-based Lexical Simplification (LS) system fined-tuned with the T5 model. The novelty of this work lies in the use of language-specific prefixes, control tokens, and candidates extracted from pre-trained masked language models to learn simpler alternatives for complex words.
arXiv Detail & Related papers (2023-07-05T08:48:19Z)
A New Dataset and Empirical Study for Sentence Simplification in Chinese [50.0624778757462]
This paper introduces CSS, a new dataset for assessing sentence simplification in Chinese. We collect manual simplifications from human annotators and perform data analysis to show the difference between English and Chinese sentence simplifications. In the end, we explore whether Large Language Models can serve as high-quality Chinese sentence simplification systems by evaluating them on CSS.
arXiv Detail & Related papers (2023-06-07T06:47:34Z)
Multilingual Simplification of Medical Texts [49.469685530201716]
We introduce MultiCochrane, the first sentence-aligned multilingual text simplification dataset for the medical domain in four languages. We evaluate fine-tuned and zero-shot models across these languages, with extensive human assessments and analyses. Although models can now generate viable simplified texts, we identify outstanding challenges that this dataset might be used to address.
arXiv Detail & Related papers (2023-05-21T18:25:07Z)
Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language. To collect large-scale CLS data, existing datasets typically involve translation in their creation. In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z)
LSA-T: The first continuous Argentinian Sign Language dataset for Sign Language Translation [52.87578398308052]
Sign language translation (SLT) is an active field of study that encompasses human-computer interaction, computer vision, natural language processing and machine learning. This paper presents the first continuous Argentinian Sign Language (LSA) dataset. It contains 14,880 sentence level videos of LSA extracted from the CN Sordos YouTube channel with labels and keypoints annotations for each signer.
arXiv Detail & Related papers (2022-11-14T14:46:44Z)
ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification [17.101023503289856]
ALEXSIS-PT is a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words. We evaluate four models for substitute generation on this dataset, namely mDistilBERT, mBERT, XLM-R, and BERTimbau.
arXiv Detail & Related papers (2022-09-19T14:10:21Z)
Lexical Simplification Benchmarks for English, Portuguese, and Spanish [23.90236014260585]
We present a new benchmark dataset for lexical simplification in English, Spanish, and (Brazilian) Portuguese. This is the first dataset that offers a direct comparison of lexical simplification systems for three languages. We find a state-of-the-art neural lexical simplification system outperforms a state-of-the-art non-neural lexical simplification system in all three languages.
arXiv Detail & Related papers (2022-09-12T15:06:26Z)
Multilingual Extraction and Categorization of Lexical Collocations with Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context. Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z)
Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs. We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models. We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z)
Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages. We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model. We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z)
Automatic Lexical Simplification for Turkish [0.0]
We present the first automatic lexical simplification system for the Turkish language. Recent text simplification efforts rely on manually crafted simplified corpora and comprehensive NLP tools. We present a new text simplification pipeline based on pretrained representation model BERT together with morphological features to generate grammatically correct and semantically appropriate word-level simplifications.
arXiv Detail & Related papers (2022-01-15T15:58:44Z)
Predicting Lexical Complexity in English Texts [6.556254680121433]
The first step in most text simplification is to predict which words are considered complex for a given target population. This task is commonly referred to as Complex Word Identification (CWI) and it is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required.
arXiv Detail & Related papers (2021-02-17T14:05:30Z)
Chinese Lexical Simplification [29.464388721085548]
There is no research work for Chinese lexical simplification ( CLS) task. To circumvent difficulties in acquiring annotations, we manually create the first benchmark dataset for CLS. We present five different types of methods as baselines to generate substitute candidates for the complex word.
arXiv Detail & Related papers (2020-10-14T12:55:36Z)
Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages. Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs. Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.