Representing Mixtures of Word Embeddings with Mixtures of Topic
Embeddings
- URL: http://arxiv.org/abs/2203.01570v1
- Date: Thu, 3 Mar 2022 08:46:23 GMT
- Title: Representing Mixtures of Word Embeddings with Mixtures of Topic
Embeddings
- Authors: Dongsheng Wang, Dandan Guo, He Zhao, Huangjie Zheng, Korawat
Tanwisuth, Bo Chen, Mingyuan Zhou
- Abstract summary: A topic model is often formulated as a generative model that explains how each word of a document is generated given a set of topics and document-specific topic proportions.
This paper introduces a new topic-modeling framework where each document is viewed as a set of word embedding vectors and each topic is modeled as an embedding vector in the same embedding space.
Embedding the words and topics in the same vector space, we define a method to measure the semantic difference between the embedding vectors of the words of a document and these of the topics, and optimize the topic embeddings to minimize the expected difference over all documents.
- Score: 46.324584649014284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A topic model is often formulated as a generative model that explains how
each word of a document is generated given a set of topics and
document-specific topic proportions. It is focused on capturing the word
co-occurrences in a document and hence often suffers from poor performance in
analyzing short documents. In addition, its parameter estimation often relies
on approximate posterior inference that is either not scalable or suffers from
large approximation error. This paper introduces a new topic-modeling framework
where each document is viewed as a set of word embedding vectors and each topic
is modeled as an embedding vector in the same embedding space. Embedding the
words and topics in the same vector space, we define a method to measure the
semantic difference between the embedding vectors of the words of a document
and these of the topics, and optimize the topic embeddings to minimize the
expected difference over all documents. Experiments on text analysis
demonstrate that the proposed method, which is amenable to mini-batch
stochastic gradient descent based optimization and hence scalable to big
corpora, provides competitive performance in discovering more coherent and
diverse topics and extracting better document representations.
Related papers
- CAST: Corpus-Aware Self-similarity Enhanced Topic modelling [16.562349140796115]
We introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method.
We find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words.
Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model's ability to handle noisy data.
arXiv Detail & Related papers (2024-10-19T15:27:11Z) - Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Improving Contextualized Topic Models with Negative Sampling [3.708656266586146]
We propose a negative sampling mechanism for a contextualized topic model to improve the quality of the generated topics.
In particular, during model training, we perturb the generated document-topic vector and use a triplet loss to encourage the document reconstructed from the correct document-topic vector to be similar to the input document.
arXiv Detail & Related papers (2023-03-27T07:28:46Z) - HanoiT: Enhancing Context-aware Translation via Selective Context [95.93730812799798]
Context-aware neural machine translation aims to use the document-level context to improve translation quality.
The irrelevant or trivial words may bring some noise and distract the model from learning the relationship between the current sentence and the auxiliary context.
We propose a novel end-to-end encoder-decoder model with a layer-wise selection mechanism to sift and refine the long document context.
arXiv Detail & Related papers (2023-01-17T12:07:13Z) - Multi-Vector Models with Textual Guidance for Fine-Grained Scientific
Document Similarity [11.157086694203201]
We present a new scientific document similarity model based on matching fine-grained aspects.
Our model is trained using co-citation contexts that describe related paper aspects as a novel form of textual supervision.
arXiv Detail & Related papers (2021-11-16T11:12:30Z) - TopicNet: Semantic Graph-Guided Topic Discovery [51.71374479354178]
Existing deep hierarchical topic models are able to extract semantically meaningful topics from a text corpus in an unsupervised manner.
We introduce TopicNet as a deep hierarchical topic model that can inject prior structural knowledge as an inductive bias to influence learning.
arXiv Detail & Related papers (2021-10-27T09:07:14Z) - Sawtooth Factorial Topic Embeddings Guided Gamma Belief Network [49.458250193768826]
We propose sawtooth factorial topic embedding guided GBN, a deep generative model of documents.
Both the words and topics are represented as embedding vectors of the same dimension.
Our models outperform other neural topic models on extracting deeper interpretable topics.
arXiv Detail & Related papers (2021-06-30T10:14:57Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Topic Scaling: A Joint Document Scaling -- Topic Model Approach To Learn
Time-Specific Topics [0.0]
This paper proposes a new methodology to study sequential corpora by implementing a two-stage algorithm that learns time-based topics with respect to a scale of document positions.
The first stage ranks documents using Wordfish to estimate document positions that serve as a dependent variable to learn relevant topics.
The second stage ranks the inferred topics on the document scale to match their occurrences within the corpus and track their evolution.
arXiv Detail & Related papers (2021-03-31T12:35:36Z) - Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for
Fast and Good Topics too! [5.819224524813161]
We propose an alternative way to obtain topics: clustering pre-trained word embeddings while incorporating document information for weighted clustering and reranking top words.
The best performing combination for our approach performs as well as classical topic models, but with lower runtime and computational complexity.
arXiv Detail & Related papers (2020-04-30T16:18:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.