CompLex: A New Corpus for Lexical Complexity Prediction from Likert
Scale Data
- URL: http://arxiv.org/abs/2003.07008v3
- Date: Thu, 11 Jun 2020 16:42:55 GMT
- Title: CompLex: A New Corpus for Lexical Complexity Prediction from Likert
Scale Data
- Authors: Matthew Shardlow, Michael Cooper, Marcos Zampieri
- Abstract summary: This paper presents the first English dataset for continuous lexical complexity prediction.
We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl, and biomedical texts.
- Score: 13.224233182417636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Predicting which words are considered hard to understand for a given target
population is a vital step in many NLP applications such as text
simplification. This task is commonly referred to as Complex Word
Identification (CWI). With a few exceptions, previous studies have approached
the task as a binary classification task in which systems predict a complexity
value (complex vs. non-complex) for a set of target words in a text. This
choice is motivated by the fact that all CWI datasets compiled so far have been
annotated using a binary annotation scheme. Our paper addresses this limitation
by presenting the first English dataset for continuous lexical complexity
prediction. We use a 5-point Likert scale scheme to annotate complex words in
texts from three sources/domains: the Bible, Europarl, and biomedical texts.
This resulted in a corpus of 9,476 sentences each annotated by around 7
annotators.
Related papers
- Syntactic Complexity Identification, Measurement, and Reduction Through
Controlled Syntactic Simplification [0.0]
We present a classical syntactic dependency-based approach to split and rephrase a compound and complex sentence into a set of simplified sentences.
The paper also introduces an algorithm to identify and measure a sentence's syntactic complexity.
This work is accepted and presented in International workshop on Learning with Knowledge Graphs (IWLKG) at WSDM-2023 Conference.
arXiv Detail & Related papers (2023-04-16T13:13:58Z) - Lexical Complexity Prediction: An Overview [13.224233182417636]
The occurrence of unknown words in texts significantly hinders reading comprehension.
computational modelling has been applied to identify complex words in texts and substitute them for simpler alternatives.
We present an overview of computational approaches to lexical complexity prediction focusing on the work carried out on English data.
arXiv Detail & Related papers (2023-03-08T19:35:08Z) - Semantic Parsing for Conversational Question Answering over Knowledge
Graphs [63.939700311269156]
We develop a dataset where user questions are annotated with Sparql parses and system answers correspond to execution results thereof.
We present two different semantic parsing approaches and highlight the challenges of the task.
Our dataset and models are released at https://github.com/Edinburgh/SPICE.
arXiv Detail & Related papers (2023-01-28T14:45:11Z) - Measuring Annotator Agreement Generally across Complex Structured,
Multi-object, and Free-text Annotation Tasks [79.24863171717972]
Inter-annotator agreement (IAA) is a key metric for quality assurance.
Measures exist for simple categorical and ordinal labeling tasks, but little work has considered more complex labeling tasks.
Krippendorff's alpha, best known for use with simpler labeling tasks, does have a distance-based formulation with broader applicability.
arXiv Detail & Related papers (2022-12-15T20:12:48Z) - Structured information extraction from complex scientific text with
fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction.
The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts.
This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z) - Text Summarization with Oracle Expectation [88.39032981994535]
Extractive summarization produces summaries by identifying and concatenating the most important sentences in a document.
Most summarization datasets do not come with gold labels indicating whether document sentences are summary-worthy.
We propose a simple yet effective labeling algorithm that creates soft, expectation-based sentence labels.
arXiv Detail & Related papers (2022-09-26T14:10:08Z) - One Size Does Not Fit All: The Case for Personalised Word Complexity
Models [4.035753155957698]
Complex Word Identification (CWI) aims to detect words within a text that a reader may find difficult to understand.
In this paper, we show that personal models are best when predicting word complexity for individual readers.
arXiv Detail & Related papers (2022-05-05T10:53:31Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - LCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for
Lexical Complexity Prediction [4.86331990243181]
This paper describes team LCP-RIT's submission to the SemEval-2021 Task 1: Lexical Complexity Prediction (LCP)
Our system uses logistic regression and a wide range of linguistic features to predict the complexity of single words in this dataset.
We evaluate the results in terms of mean absolute error, mean squared error, Pearson correlation, and Spearman correlation.
arXiv Detail & Related papers (2021-05-18T18:55:04Z) - Predicting Lexical Complexity in English Texts [6.556254680121433]
The first step in most text simplification is to predict which words are considered complex for a given target population.
This task is commonly referred to as Complex Word Identification (CWI) and it is often modelled as a supervised classification problem.
For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required.
arXiv Detail & Related papers (2021-02-17T14:05:30Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.