Lexical Complexity Prediction: An Overview
- URL: http://arxiv.org/abs/2303.04851v1
- Date: Wed, 8 Mar 2023 19:35:08 GMT
- Title: Lexical Complexity Prediction: An Overview
- Authors: Kai North, Marcos Zampieri, Matthew Shardlow
- Abstract summary: The occurrence of unknown words in texts significantly hinders reading comprehension.
computational modelling has been applied to identify complex words in texts and substitute them for simpler alternatives.
We present an overview of computational approaches to lexical complexity prediction focusing on the work carried out on English data.
- Score: 13.224233182417636
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The occurrence of unknown words in texts significantly hinders reading
comprehension. To improve accessibility for specific target populations,
computational modelling has been applied to identify complex words in texts and
substitute them for simpler alternatives. In this paper, we present an overview
of computational approaches to lexical complexity prediction focusing on the
work carried out on English data. We survey relevant approaches to this problem
which include traditional machine learning classifiers (e.g. SVMs, logistic
regression) and deep neural networks as well as a variety of features, such as
those inspired by literature in psycholinguistics as well as word frequency,
word length, and many others. Furthermore, we introduce readers to past
competitions and available datasets created on this topic. Finally, we include
brief sections on applications of lexical complexity prediction, such as
readability and text simplification, together with related studies on languages
other than English.
Related papers
- On the Proper Treatment of Tokenization in Psycholinguistics [53.960910019072436]
The paper argues that token-level language models should be marginalized into character-level language models before they are used in psycholinguistic studies.
We find various focal areas whose surprisal is a better psychometric predictor than the surprisal of the region of interest itself.
arXiv Detail & Related papers (2024-10-03T17:18:03Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Computational Sentence-level Metrics Predicting Human Sentence Comprehension [27.152245569974678]
This study introduces innovative methods for computing sentence-level metrics using multilingual large language models.
The metrics developed sentence surprisal and sentence relevance then are tested and compared to validate whether they can predict how humans comprehend sentences as a whole across languages.
arXiv Detail & Related papers (2024-03-23T12:19:49Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - One Size Does Not Fit All: The Case for Personalised Word Complexity
Models [4.035753155957698]
Complex Word Identification (CWI) aims to detect words within a text that a reader may find difficult to understand.
In this paper, we show that personal models are best when predicting word complexity for individual readers.
arXiv Detail & Related papers (2022-05-05T10:53:31Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Predicting Lexical Complexity in English Texts [6.556254680121433]
The first step in most text simplification is to predict which words are considered complex for a given target population.
This task is commonly referred to as Complex Word Identification (CWI) and it is often modelled as a supervised classification problem.
For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required.
arXiv Detail & Related papers (2021-02-17T14:05:30Z) - Text Mining for Processing Interview Data in Computational Social
Science [0.6820436130599382]
We use commercially available text analysis technology to process interview text data from a computational social science study.
We find that topical clustering and terminological enrichment provide for convenient exploration and quantification of the responses.
We encourage studies in social science to use text analysis, especially for exploratory open-ended studies.
arXiv Detail & Related papers (2020-11-28T00:44:35Z) - Be More with Less: Hypergraph Attention Networks for Inductive Text
Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task.
Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words.
We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z) - CompLex: A New Corpus for Lexical Complexity Prediction from Likert
Scale Data [13.224233182417636]
This paper presents the first English dataset for continuous lexical complexity prediction.
We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl, and biomedical texts.
arXiv Detail & Related papers (2020-03-16T03:54:22Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.