ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification
- URL: http://arxiv.org/abs/2209.09034v2
- Date: Fri, 9 Feb 2024 15:30:08 GMT
- Title: ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification
- Authors: Kai North, Marcos Zampieri, Tharindu Ranasinghe
- Abstract summary: ALEXSIS-PT is a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words.
We evaluate four models for substitute generation on this dataset, namely mDistilBERT, mBERT, XLM-R, and BERTimbau.
- Score: 17.101023503289856
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Lexical simplification (LS) is the task of automatically replacing complex
words for easier ones making texts more accessible to various target
populations (e.g. individuals with low literacy, individuals with learning
disabilities, second language learners). To train and test models, LS systems
usually require corpora that feature complex words in context along with their
candidate substitutions. To continue improving the performance of LS systems we
introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese
LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT
has been compiled following the ALEXSIS protocol for Spanish opening exciting
new avenues for cross-lingual models. ALEXSIS-PT is the first LS
multi-candidate dataset that contains Brazilian newspaper articles. We
evaluated four models for substitute generation on this dataset, namely
mDistilBERT, mBERT, XLM-R, and BERTimbau. BERTimbau achieved the highest
performance across all evaluation metrics.
Related papers
- Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts.
We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both.
Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z) - MultiLS: A Multi-task Lexical Simplification Framework [21.81108113189197]
We present MultiLS, the first LS framework that allows for the creation of a multi-task LS dataset.
We also present MultiLS-PT, the first dataset to be created using the MultiLS framework.
arXiv Detail & Related papers (2024-02-22T21:16:18Z) - Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval [56.65147231836708]
We develop SWIM-IR, a synthetic retrieval training dataset containing 33 languages for fine-tuning multilingual dense retrievers.
SAP assists the large language model (LLM) in generating informative queries in the target language.
Our models, called SWIM-X, are competitive with human-supervised dense retrieval models.
arXiv Detail & Related papers (2023-11-10T00:17:10Z) - Multilingual Controllable Transformer-Based Lexical Simplification [4.718531520078843]
This paper proposes mTLS, a controllable Transformer-based Lexical Simplification (LS) system fined-tuned with the T5 model.
The novelty of this work lies in the use of language-specific prefixes, control tokens, and candidates extracted from pre-trained masked language models to learn simpler alternatives for complex words.
arXiv Detail & Related papers (2023-07-05T08:48:19Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.