Extending Multi-Sense Word Embedding to Phrases and Sentences for
Unsupervised Semantic Applications
- URL: http://arxiv.org/abs/2103.15330v1
- Date: Mon, 29 Mar 2021 04:54:28 GMT
- Title: Extending Multi-Sense Word Embedding to Phrases and Sentences for
Unsupervised Semantic Applications
- Authors: Haw-Shiuan Chang, Amol Agrawal, Andrew McCallum
- Abstract summary: We propose a novel embedding method for a text sequence (a phrase or a sentence) where each sequence is represented by a distinct set of codebook embeddings.
Our experiments show that the per-sentence codebook embeddings significantly improve the performances in unsupervised sentence similarity and extractive summarization benchmarks.
- Score: 34.71597411512625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most unsupervised NLP models represent each word with a single point or
single region in semantic space, while the existing multi-sense word embeddings
cannot represent longer word sequences like phrases or sentences. We propose a
novel embedding method for a text sequence (a phrase or a sentence) where each
sequence is represented by a distinct set of multi-mode codebook embeddings to
capture different semantic facets of its meaning. The codebook embeddings can
be viewed as the cluster centers which summarize the distribution of possibly
co-occurring words in a pre-trained word embedding space. We introduce an
end-to-end trainable neural model that directly predicts the set of cluster
centers from the input text sequence during test time. Our experiments show
that the per-sentence codebook embeddings significantly improve the
performances in unsupervised sentence similarity and extractive summarization
benchmarks. In phrase similarity experiments, we discover that the multi-facet
embeddings provide an interpretable semantic representation but do not
outperform the single-facet baseline.
Related papers
- Multi hash embeddings in spaCy [1.6790532021482656]
spaCy is a machine learning system that generates multi-embedding representations of words.
The default embedding layer in spaCy is a hash embeddings layer.
In this technical report we lay out a bit of history and introduce the embedding methods in spaCy in detail.
arXiv Detail & Related papers (2022-12-19T06:03:04Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Clustering and Network Analysis for the Embedding Spaces of Sentences
and Sub-Sentences [69.3939291118954]
This paper reports research on a set of comprehensive clustering and network analyses targeting sentence and sub-sentence embedding spaces.
Results show that one method generates the most clusterable embeddings.
In general, the embeddings of span sub-sentences have better clustering properties than the original sentences.
arXiv Detail & Related papers (2021-10-02T00:47:35Z) - SemGloVe: Semantic Co-occurrences for GloVe from BERT [55.420035541274444]
GloVe learns word embeddings by leveraging statistical information from word co-occurrence matrices.
We propose SemGloVe, which distills semantic co-occurrences from BERT into static GloVe word embeddings.
arXiv Detail & Related papers (2020-12-30T15:38:26Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z) - A Correspondence Variational Autoencoder for Unsupervised Acoustic Word
Embeddings [50.524054820564395]
We propose a new unsupervised model for mapping a variable-duration speech segment to a fixed-dimensional representation.
The resulting acoustic word embeddings can form the basis of search, discovery, and indexing systems for low- and zero-resource languages.
arXiv Detail & Related papers (2020-12-03T19:24:42Z) - PBoS: Probabilistic Bag-of-Subwords for Generalizing Word Embedding [16.531103175919924]
We look into the task of emphgeneralizing word embeddings.
given a set of pre-trained word vectors over a finite vocabulary, the goal is to predict embedding vectors for out-of-vocabulary words.
We propose a model, along with an efficient algorithm, that simultaneously models subword segmentation and computes subword-based compositional word embedding.
arXiv Detail & Related papers (2020-10-21T08:11:08Z) - Unsupervised Summarization by Jointly Extracting Sentences and Keywords [12.387378783627762]
RepRank is an unsupervised graph-based ranking model for extractive multi-document summarization.
We show that salient sentences and keywords can be extracted in a joint and mutual reinforcement process using our learned representations.
Experiment results with multiple benchmark datasets show that RepRank achieved the best or comparable performance in ROUGE.
arXiv Detail & Related papers (2020-09-16T05:58:00Z) - Supervised Understanding of Word Embeddings [1.160208922584163]
We have obtained supervised projections in the form of the linear keyword-level classifiers on word embeddings.
We have shown that the method creates interpretable projections of original embedding dimensions.
arXiv Detail & Related papers (2020-06-23T20:13:42Z) - Neural Syntactic Preordering for Controlled Paraphrase Generation [57.5316011554622]
Our work uses syntactic transformations to softly "reorder'' the source sentence and guide our neural paraphrasing model.
First, given an input sentence, we derive a set of feasible syntactic rearrangements using an encoder-decoder model.
Next, we use each proposed rearrangement to produce a sequence of position embeddings, which encourages our final encoder-decoder paraphrase model to attend to the source words in a particular order.
arXiv Detail & Related papers (2020-05-05T09:02:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.