Compass-aligned Distributional Embeddings for Studying Semantic
Differences across Corpora
- URL: http://arxiv.org/abs/2004.06519v1
- Date: Mon, 13 Apr 2020 15:46:47 GMT
- Title: Compass-aligned Distributional Embeddings for Studying Semantic
Differences across Corpora
- Authors: Federico Bianchi and Valerio Di Carlo and Paolo Nicoli and Matteo
Palmonari
- Abstract summary: We present a framework to support cross-corpora language studies with word embeddings.
CADE is the core component of our framework and solves the key problem of aligning the embeddings generated from different corpora.
The results of our experiments suggest that CADE achieves state-of-the-art or superior performance on tasks where several competing approaches are available.
- Score: 14.993021283916008
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Word2vec is one of the most used algorithms to generate word embeddings
because of a good mix of efficiency, quality of the generated representations
and cognitive grounding. However, word meaning is not static and depends on the
context in which words are used. Differences in word meaning that depends on
time, location, topic, and other factors, can be studied by analyzing
embeddings generated from different corpora in collections that are
representative of these factors. For example, language evolution can be studied
using a collection of news articles published in different time periods. In
this paper, we present a general framework to support cross-corpora language
studies with word embeddings, where embeddings generated from different corpora
can be compared to find correspondences and differences in meaning across the
corpora. CADE is the core component of our framework and solves the key problem
of aligning the embeddings generated from different corpora. In particular, we
focus on providing solid evidence about the effectiveness, generality, and
robustness of CADE. To this end, we conduct quantitative and qualitative
experiments in different domains, from temporal word embeddings to language
localization and topical analysis. The results of our experiments suggest that
CADE achieves state-of-the-art or superior performance on tasks where several
competing approaches are available, yet providing a general method that can be
used in a variety of domains. Finally, our experiments shed light on the
conditions under which the alignment is reliable, which substantially depends
on the degree of cross-corpora vocabulary overlap.
Related papers
- How Do Transformers Learn Topic Structure: Towards a Mechanistic
Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure"
We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z) - Keywords and Instances: A Hierarchical Contrastive Learning Framework
Unifying Hybrid Granularities for Text Generation [59.01297461453444]
We propose a hierarchical contrastive learning mechanism, which can unify hybrid granularities semantic meaning in the input text.
Experiments demonstrate that our model outperforms competitive baselines on paraphrasing, dialogue generation, and storytelling tasks.
arXiv Detail & Related papers (2022-05-26T13:26:03Z) - Human-in-the-Loop Refinement of Word Embeddings [0.0]
We propose a system that incorporates an adaptation of word embedding post-processing, which we call "interactive refitting"
Our approach allows a human to identify and address potential quality issues with word embeddings interactively.
It also allows for better insight into what effect word embeddings, and refinements to word embeddings, have on machine learning pipelines.
arXiv Detail & Related papers (2021-10-06T16:10:32Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - On the Impact of Knowledge-based Linguistic Annotations in the Quality
of Scientific Embeddings [0.0]
We conduct a study on the use of explicit linguistic annotations to generate embeddings from a scientific corpus.
Our results show how the effect of such annotations in the embeddings varies depending on the evaluation task.
In general, we observe that learning embeddings using linguistic annotations contributes to achieve better evaluation results.
arXiv Detail & Related papers (2021-04-13T13:51:22Z) - EDS-MEMBED: Multi-sense embeddings based on enhanced distributional
semantic structures via a graph walk over word senses [0.0]
We leverage the rich semantic structures in WordNet to enhance the quality of multi-sense embeddings.
We derive new distributional semantic similarity measures for M-SE from prior ones.
We report evaluation results on 11 benchmark datasets involving WSD and Word Similarity tasks.
arXiv Detail & Related papers (2021-02-27T14:36:55Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - Blind signal decomposition of various word embeddings based on join and
individual variance explained [11.542392473831672]
We propose to use a novel joint signal separation method - JIVE to jointly decompose various trained word embeddings into joint and individual components.
We conducted empirical study on word2vec, FastText and GLoVE trained on different corpus and with different dimensions.
We found that by mapping different word embeddings into the joint component, sentiment performance can be greatly improved for the original word embeddings with lower performance.
arXiv Detail & Related papers (2020-11-30T01:36:29Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z) - Comparative Analysis of Word Embeddings for Capturing Word Similarities [0.0]
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks.
Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings.
selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.
arXiv Detail & Related papers (2020-05-08T01:16:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.