LCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for
Lexical Complexity Prediction
- URL: http://arxiv.org/abs/2105.08780v1
- Date: Tue, 18 May 2021 18:55:04 GMT
- Title: LCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for
Lexical Complexity Prediction
- Authors: Abhinandan Desai and Kai North and Marcos Zampieri and Christopher M.
Homan
- Abstract summary: This paper describes team LCP-RIT's submission to the SemEval-2021 Task 1: Lexical Complexity Prediction (LCP)
Our system uses logistic regression and a wide range of linguistic features to predict the complexity of single words in this dataset.
We evaluate the results in terms of mean absolute error, mean squared error, Pearson correlation, and Spearman correlation.
- Score: 4.86331990243181
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper describes team LCP-RIT's submission to the SemEval-2021 Task 1:
Lexical Complexity Prediction (LCP). The task organizers provided participants
with an augmented version of CompLex (Shardlow et al., 2020), an English
multi-domain dataset in which words in context were annotated with respect to
their complexity using a five point Likert scale. Our system uses logistic
regression and a wide range of linguistic features (e.g. psycholinguistic
features, n-grams, word frequency, POS tags) to predict the complexity of
single words in this dataset. We analyze the impact of different linguistic
features in the classification performance and we evaluate the results in terms
of mean absolute error, mean squared error, Pearson correlation, and Spearman
correlation.
Related papers
- Lexical Complexity Prediction: An Overview [13.224233182417636]
The occurrence of unknown words in texts significantly hinders reading comprehension.
computational modelling has been applied to identify complex words in texts and substitute them for simpler alternatives.
We present an overview of computational approaches to lexical complexity prediction focusing on the work carried out on English data.
arXiv Detail & Related papers (2023-03-08T19:35:08Z) - Compositional Generalization in Grounded Language Learning via Induced
Model Sparsity [81.38804205212425]
We consider simple language-conditioned navigation problems in a grid world environment with disentangled observations.
We design an agent that encourages sparse correlations between words in the instruction and attributes of objects, composing them together to find the goal.
Our agent maintains a high level of performance on goals containing novel combinations of properties even when learning from a handful of demonstrations.
arXiv Detail & Related papers (2022-07-06T08:46:27Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - Domain Adaptation in Multilingual and Multi-Domain Monolingual Settings
for Complex Word Identification [0.27998963147546146]
Complex word identification (CWI) is a cornerstone process towards proper text simplification.
CWI is highly dependent on context, whereas its difficulty is augmented by the scarcity of available datasets.
We propose a novel training technique for the CWI task based on domain adaptation to improve the target character and context representations.
arXiv Detail & Related papers (2022-05-15T13:21:02Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - UPB at SemEval-2021 Task 1: Combining Deep Learning and Hand-Crafted
Features for Lexical Complexity Prediction [0.7197592390105455]
We describe our approach for the SemEval-2021 Task 1: Lexical Complexity Prediction competition.
Our results are just 5.46% and 6.5% lower than the top scores obtained in the competition on the first and the second subtasks.
arXiv Detail & Related papers (2021-04-14T17:05:46Z) - NEMO: Frequentist Inference Approach to Constrained Linguistic Typology
Feature Prediction in SIGTYP 2020 Shared Task [83.43738174234053]
We employ frequentist inference to represent correlations between typological features and use this representation to train simple multi-class estimators that predict individual features.
Our best configuration achieved the micro-averaged accuracy score of 0.66 on 149 test languages.
arXiv Detail & Related papers (2020-10-12T19:25:43Z) - Probing Linguistic Features of Sentence-Level Representations in Neural
Relation Extraction [80.38130122127882]
We introduce 14 probing tasks targeting linguistic properties relevant to neural relation extraction (RE)
We use them to study representations learned by more than 40 different encoder architecture and linguistic feature combinations trained on two datasets.
We find that the bias induced by the architecture and the inclusion of linguistic features are clearly expressed in the probing task performance.
arXiv Detail & Related papers (2020-04-17T09:17:40Z) - CompLex: A New Corpus for Lexical Complexity Prediction from Likert
Scale Data [13.224233182417636]
This paper presents the first English dataset for continuous lexical complexity prediction.
We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl, and biomedical texts.
arXiv Detail & Related papers (2020-03-16T03:54:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.