Bridging Natural Language Processing and Psycholinguistics:
computationally grounded semantic similarity datasets for Basque and Spanish
- URL: http://arxiv.org/abs/2304.09616v2
- Date: Thu, 20 Apr 2023 08:23:21 GMT
- Title: Bridging Natural Language Processing and Psycholinguistics:
computationally grounded semantic similarity datasets for Basque and Spanish
- Authors: J. Goikoetxea, M. Arantzeta, I. San Martin
- Abstract summary: We present a word similarity dataset based on two well-known Natural Language Processing resources; text corpora and knowledge bases.
The present dataset includes noun pairs' information in Basque and European Spanish, but further work intends to extend it to more languages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a computationally-grounded word similarity dataset based on two
well-known Natural Language Processing resources; text corpora and knowledge
bases. This dataset aims to fulfil a gap in psycholinguistic research by
providing a variety of quantifications of semantic similarity in an extensive
set of noun pairs controlled by variables that play a significant role in
lexical processing. The dataset creation has consisted in three steps, 1)
computing four key psycholinguistic features for each noun; concreteness,
frequency, semantic and phonological neighbourhood density; 2) pairing nouns
across these four variables; 3) for each noun pair, assigning three types of
word similarity measurements, computed out of text, Wordnet and hybrid
embeddings. The present dataset includes noun pairs' information in Basque and
European Spanish, but further work intends to extend it to more languages.
Related papers
- UniPSDA: Unsupervised Pseudo Semantic Data Augmentation for Zero-Shot Cross-Lingual Natural Language Understanding [31.272603877215733]
Cross-lingual representation learning transfers knowledge from resource-rich data to resource-scarce ones to improve the semantic understanding abilities of different languages.
We propose an Unsupervised Pseudo Semantic Data Augmentation (UniPSDA) mechanism for cross-lingual natural language understanding to enrich the training data without human interventions.
arXiv Detail & Related papers (2024-06-24T07:27:01Z) - Domain Embeddings for Generating Complex Descriptions of Concepts in
Italian Language [65.268245109828]
We propose a Distributional Semantic resource enriched with linguistic and lexical information extracted from electronic dictionaries.
The resource comprises 21 domain-specific matrices, one comprehensive matrix, and a Graphical User Interface.
Our model facilitates the generation of reasoned semantic descriptions of concepts by selecting matrices directly associated with concrete conceptual knowledge.
arXiv Detail & Related papers (2024-02-26T15:04:35Z) - SpaDeLeF: A Dataset for Hierarchical Classification of Lexical Functions
for Collocations in Spanish [6.9454683800956705]
We present a dataset of most frequent Spanish verb-noun collocations and sentences where they occur.
Each collocation is assigned to one of 37 lexical functions defined as classes for a hierarchical classification task.
We combine the classes in a tree-based structure, and introduce classification objectives for each level of the structure.
arXiv Detail & Related papers (2023-11-07T18:32:34Z) - Agentivit\`a e telicit\`a in GilBERTo: implicazioni cognitive [77.71680953280436]
The goal of this study is to investigate whether a Transformer-based neural language model infers lexical semantics.
The semantic properties considered are telicity (also combined with definiteness) and agentivity.
arXiv Detail & Related papers (2023-07-06T10:52:22Z) - SimRelUz: Similarity and Relatedness scores as a Semantic Evaluation
dataset for Uzbek language [0.0]
We present a semantic model evaluation dataset: SimRelUz.
The dataset consists of more than a thousand pairs of words carefully selected based on their morphological features.
We also paid attention to the problem of dealing with rare words and out-of-vocabulary words.
arXiv Detail & Related papers (2022-05-12T13:11:28Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Decomposing lexical and compositional syntax and semantics with deep
language models [82.81964713263483]
The activations of language transformers like GPT2 have been shown to linearly map onto brain activity during speech comprehension.
Here, we propose a taxonomy to factorize the high-dimensional activations of language models into four classes: lexical, compositional, syntactic, and semantic representations.
The results highlight two findings. First, compositional representations recruit a more widespread cortical network than lexical ones, and encompass the bilateral temporal, parietal and prefrontal cortices.
arXiv Detail & Related papers (2021-03-02T10:24:05Z) - Multilingual Irony Detection with Dependency Syntax and Neural Models [61.32653485523036]
It focuses on the contribution from syntactic knowledge, exploiting linguistic resources where syntax is annotated according to the Universal Dependencies scheme.
The results suggest that fine-grained dependency-based syntactic information is informative for the detection of irony.
arXiv Detail & Related papers (2020-11-11T11:22:05Z) - BabelEnconding at SemEval-2020 Task 3: Contextual Similarity as a
Combination of Multilingualism and Language Models [0.5276232626689568]
This paper describes the system submitted by our team (BabelEnconding) to SemEval-2020 Task 3: Predicting the Graded Effect of Context in Word Similarity.
arXiv Detail & Related papers (2020-08-19T13:46:37Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.