Predicting Lexical Complexity in English Texts
- URL: http://arxiv.org/abs/2102.08773v1
- Date: Wed, 17 Feb 2021 14:05:30 GMT
- Title: Predicting Lexical Complexity in English Texts
- Authors: Matthew Shardlow, Richard Evans and Marcos Zampieri
- Abstract summary: The first step in most text simplification is to predict which words are considered complex for a given target population.
This task is commonly referred to as Complex Word Identification (CWI) and it is often modelled as a supervised classification problem.
For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required.
- Score: 6.556254680121433
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The first step in most text simplification is to predict which words are
considered complex for a given target population before carrying out lexical
substitution. This task is commonly referred to as Complex Word Identification
(CWI) and it is often modelled as a supervised classification problem. For
training such systems, annotated datasets in which words and sometimes
multi-word expressions are labelled regarding complexity are required. In this
paper we analyze previous work carried out in this task and investigate the
properties of complex word identification datasets for English.
Related papers
- H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasoning on Tables [56.73919743039263]
This paper introduces a novel algorithm that integrates both symbolic and semantic (textual) approaches in a two-stage process to address limitations.
Our experiments demonstrate that H-STAR significantly outperforms state-of-the-art methods across three question-answering (QA) and fact-verification datasets.
arXiv Detail & Related papers (2024-06-29T21:24:19Z) - Evaluation of Semantic Search and its Role in Retrieved-Augmented-Generation (RAG) for Arabic Language [0.0]
This paper endeavors to establish a straightforward yet potent benchmark for semantic search in Arabic.
To precisely evaluate the effectiveness of these metrics and the dataset, we conduct our assessment of semantic search within the framework of retrieval augmented generation (RAG)
arXiv Detail & Related papers (2024-03-27T08:42:31Z) - Conjunct Resolution in the Face of Verbal Omissions [51.220650412095665]
We propose a conjunct resolution task that operates directly on the text and makes use of a split-and-rephrase paradigm in order to recover the missing elements in the coordination structure.
We curate a large dataset, containing over 10K examples of naturally-occurring verbal omissions with crowd-sourced annotations.
We train various neural baselines for this task, and show that while our best method obtains decent performance, it leaves ample space for improvement.
arXiv Detail & Related papers (2023-05-26T08:44:02Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - Lexical Complexity Prediction: An Overview [13.224233182417636]
The occurrence of unknown words in texts significantly hinders reading comprehension.
computational modelling has been applied to identify complex words in texts and substitute them for simpler alternatives.
We present an overview of computational approaches to lexical complexity prediction focusing on the work carried out on English data.
arXiv Detail & Related papers (2023-03-08T19:35:08Z) - Improving Multi-task Generalization Ability for Neural Text Matching via
Prompt Learning [54.66399120084227]
Recent state-of-the-art neural text matching models (PLMs) are hard to generalize to different tasks.
We adopt a specialization-generalization training strategy and refer to it as Match-Prompt.
In specialization stage, descriptions of different matching tasks are mapped to only a few prompt tokens.
In generalization stage, text matching model explores the essential matching signals by being trained on diverse multiple matching tasks.
arXiv Detail & Related papers (2022-04-06T11:01:08Z) - Relation Clustering in Narrative Knowledge Graphs [71.98234178455398]
relational sentences in the original text are embedded (with SBERT) and clustered in order to merge together semantically similar relations.
Preliminary tests show that such clustering might successfully detect similar relations, and provide a valuable preprocessing for semi-supervised approaches.
arXiv Detail & Related papers (2020-11-27T10:43:04Z) - Chinese Lexical Simplification [29.464388721085548]
There is no research work for Chinese lexical simplification ( CLS) task.
To circumvent difficulties in acquiring annotations, we manually create the first benchmark dataset for CLS.
We present five different types of methods as baselines to generate substitute candidates for the complex word.
arXiv Detail & Related papers (2020-10-14T12:55:36Z) - Detecting Multiword Expression Type Helps Lexical Complexity Assessment [11.347177310504737]
Multiword expressions (MWEs) represent lexemes that should be treated as single lexical units due to their idiosyncratic nature.
Multiple NLP applications have been shown to benefit from MWE identification, however the research on lexical complexity of MWEs is still an underexplored area.
arXiv Detail & Related papers (2020-05-12T11:25:07Z) - ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification
Models with Multiple Rewriting Transformations [97.27005783856285]
This paper introduces ASSET, a new dataset for assessing sentence simplification in English.
We show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task.
arXiv Detail & Related papers (2020-05-01T16:44:54Z) - CompLex: A New Corpus for Lexical Complexity Prediction from Likert
Scale Data [13.224233182417636]
This paper presents the first English dataset for continuous lexical complexity prediction.
We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl, and biomedical texts.
arXiv Detail & Related papers (2020-03-16T03:54:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.