Evaluating Word Embeddings with Categorical Modularity
- URL: http://arxiv.org/abs/2106.00877v1
- Date: Wed, 2 Jun 2021 01:29:11 GMT
- Title: Evaluating Word Embeddings with Categorical Modularity
- Authors: S\'ilvia Casacuberta, Karina Halevy, Dami\'an E. Blasi
- Abstract summary: We introduce categorical modularity, a novel low-resource metric to evaluate word embedding quality.
We use a core set of 500 words belonging to 59 neurobiologically motivated semantic categories in 29 languages.
We find moderate to strong positive correlations between categorical modularity and performance on the monolingual tasks of sentiment analysis and word similarity calculation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce categorical modularity, a novel low-resource intrinsic metric to
evaluate word embedding quality. Categorical modularity is a graph modularity
metric based on the $k$-nearest neighbor graph constructed with embedding
vectors of words from a fixed set of semantic categories, in which the goal is
to measure the proportion of words that have nearest neighbors within the same
categories. We use a core set of 500 words belonging to 59 neurobiologically
motivated semantic categories in 29 languages and analyze three word embedding
models per language (FastText, MUSE, and subs2vec). We find moderate to strong
positive correlations between categorical modularity and performance on the
monolingual tasks of sentiment analysis and word similarity calculation and on
the cross-lingual task of bilingual lexicon induction both to and from English.
Overall, we suggest that categorical modularity provides non-trivial predictive
information about downstream task performance, with breakdowns of correlations
by model suggesting some meta-predictive properties about semantic information
loss as well.
Related papers
- Investigating Idiomaticity in Word Representations [9.208145117062339]
We focus on noun compounds of varying levels of idiomaticity in two languages (English and Portuguese)
We present a dataset of minimal pairs containing human idiomaticity judgments for each noun compound at both type and token levels.
We define a set of fine-grained metrics of Affinity and Scaled Similarity to determine how sensitive the models are to perturbations that may lead to changes in idiomaticity.
arXiv Detail & Related papers (2024-11-04T21:05:01Z) - Domain Embeddings for Generating Complex Descriptions of Concepts in
Italian Language [65.268245109828]
We propose a Distributional Semantic resource enriched with linguistic and lexical information extracted from electronic dictionaries.
The resource comprises 21 domain-specific matrices, one comprehensive matrix, and a Graphical User Interface.
Our model facilitates the generation of reasoned semantic descriptions of concepts by selecting matrices directly associated with concrete conceptual knowledge.
arXiv Detail & Related papers (2024-02-26T15:04:35Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Topics in the Haystack: Extracting and Evaluating Topics beyond
Coherence [0.0]
We propose a method that incorporates a deeper understanding of both sentence and document themes.
This allows our model to detect latent topics that may include uncommon words or neologisms.
We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task.
arXiv Detail & Related papers (2023-03-30T12:24:25Z) - A Comprehensive Empirical Evaluation of Existing Word Embedding
Approaches [5.065947993017158]
We present the characteristics of existing word embedding approaches and analyze them with regard to many classification tasks.
Traditional approaches mostly use matrix factorization to produce word representations, and they are not able to capture the semantic and syntactic regularities of the language very well.
On the other hand, Neural-network-based approaches can capture sophisticated regularities of the language and preserve the word relationships in the generated word representations.
arXiv Detail & Related papers (2023-03-13T15:34:19Z) - Variational Cross-Graph Reasoning and Adaptive Structured Semantics
Learning for Compositional Temporal Grounding [143.5927158318524]
Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence.
We introduce a new Compositional Temporal Grounding task and construct two new dataset splits.
We argue that the inherent structured semantics inside the videos and language is the crucial factor to achieve compositional generalization.
arXiv Detail & Related papers (2023-01-22T08:02:23Z) - Compositional Generalization in Grounded Language Learning via Induced
Model Sparsity [81.38804205212425]
We consider simple language-conditioned navigation problems in a grid world environment with disentangled observations.
We design an agent that encourages sparse correlations between words in the instruction and attributes of objects, composing them together to find the goal.
Our agent maintains a high level of performance on goals containing novel combinations of properties even when learning from a handful of demonstrations.
arXiv Detail & Related papers (2022-07-06T08:46:27Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Decomposing lexical and compositional syntax and semantics with deep
language models [82.81964713263483]
The activations of language transformers like GPT2 have been shown to linearly map onto brain activity during speech comprehension.
Here, we propose a taxonomy to factorize the high-dimensional activations of language models into four classes: lexical, compositional, syntactic, and semantic representations.
The results highlight two findings. First, compositional representations recruit a more widespread cortical network than lexical ones, and encompass the bilateral temporal, parietal and prefrontal cortices.
arXiv Detail & Related papers (2021-03-02T10:24:05Z) - Enhanced word embeddings using multi-semantic representation through
lexical chains [1.8199326045904998]
We propose two novel algorithms, called Flexible Lexical Chain II and Fixed Lexical Chain II.
These algorithms combine the semantic relations derived from lexical chains, prior knowledge from lexical databases, and the robustness of the distributional hypothesis in word embeddings as building blocks forming a single system.
Our results show the integration between lexical chains and word embeddings representations sustain state-of-the-art results, even against more complex systems.
arXiv Detail & Related papers (2021-01-22T09:43:33Z) - On the Importance of Word and Sentence Representation Learning in
Implicit Discourse Relation Classification [43.483855615908695]
Implicit discourse relation classification is one of the most difficult parts in shallow discourse parsing.
We argue that a powerful contextualized representation module, a bilateral multi-perspective matching module, and a global information fusion module are all important to implicit discourse analysis.
arXiv Detail & Related papers (2020-04-27T07:41:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.