Efficient and Flexible Topic Modeling using Pretrained Embeddings and
Bag of Sentences
- URL: http://arxiv.org/abs/2302.03106v3
- Date: Sat, 10 Feb 2024 17:45:03 GMT
- Title: Efficient and Flexible Topic Modeling using Pretrained Embeddings and
Bag of Sentences
- Authors: Johannes Schneider
- Abstract summary: We propose a novel topic modeling and inference algorithm.
We leverage pre-trained sentence embeddings by combining generative process models and clustering.
TheTailor evaluation shows that our method yields state-of-the art results with relatively little computational demands.
- Score: 1.8592384822257952
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained language models have led to a new state-of-the-art in many NLP
tasks. However, for topic modeling, statistical generative models such as LDA
are still prevalent, which do not easily allow incorporating contextual word
vectors. They might yield topics that do not align well with human judgment. In
this work, we propose a novel topic modeling and inference algorithm. We
suggest a bag of sentences (BoS) approach using sentences as the unit of
analysis. We leverage pre-trained sentence embeddings by combining generative
process models and clustering. We derive a fast inference algorithm based on
expectation maximization, hard assignments, and an annealing process. The
evaluation shows that our method yields state-of-the art results with
relatively little computational demands. Our method is also more flexible
compared to prior works leveraging word embeddings, since it provides the
possibility to customize topic-document distributions using priors. Code and
data is at \url{https://github.com/JohnTailor/BertSenClu}.
Related papers
- Topic Modeling with Fine-tuning LLMs and Bag of Sentences [1.8592384822257952]
FT-Topic is an unsupervised fine-tuning approach for topic modeling.
SenClu is a state-of-the-art topic modeling method that achieves fast inference and hard assignments of sentence groups to a single topic.
arXiv Detail & Related papers (2024-08-06T11:04:07Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Improving Pre-trained Language Model Fine-tuning with Noise Stability
Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR)
Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model.
We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z) - Show Me How To Revise: Improving Lexically Constrained Sentence
Generation with XLNet [27.567493727582736]
We propose a two-step approach, "Predict and Revise", for constrained sentence generation.
During the predict step, we leveraged the classifier to compute the learned prior for the candidate sentence.
During the revise step, we resorted to MCMC sampling to revise the candidate sentence by conducting a sampled action at a sampled position drawn from the learned prior.
Experimental results have demonstrated that our proposed model performs much better than the previous work in terms of sentence fluency and diversity.
arXiv Detail & Related papers (2021-09-13T09:21:07Z) - A New Sentence Ordering Method Using BERT Pretrained Model [2.1793134762413433]
We propose a method for sentence ordering which does not need a training phase and consequently a large corpus for learning.
Our proposed method outperformed other baselines on ROCStories, a corpus of 5-sentence human-made stories.
Among other advantages of this method are its interpretability and needlessness to linguistic knowledge.
arXiv Detail & Related papers (2021-08-26T18:47:15Z) - Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods
in Natural Language Processing [78.8500633981247]
This paper surveys and organizes research works in a new paradigm in natural language processing, which we dub "prompt-based learning"
Unlike traditional supervised learning, which trains a model to take in an input x and predict an output y as P(y|x), prompt-based learning is based on language models that model the probability of text directly.
arXiv Detail & Related papers (2021-07-28T18:09:46Z) - Few-shot Learning for Topic Modeling [39.56814839510978]
We propose a neural network-based few-shot learning method that can learn a topic model from just a few documents.
We demonstrate that the proposed method achieves better perplexity than existing methods using three real-world text document sets.
arXiv Detail & Related papers (2021-04-19T01:56:48Z) - Toward Better Storylines with Sentence-Level Language Models [54.91921545103256]
We propose a sentence-level language model which selects the next sentence in a story from a finite set of fluent alternatives.
We demonstrate the effectiveness of our approach with state-of-the-art accuracy on the unsupervised Story Cloze task.
arXiv Detail & Related papers (2020-05-11T16:54:19Z) - Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for
Fast and Good Topics too! [5.819224524813161]
We propose an alternative way to obtain topics: clustering pre-trained word embeddings while incorporating document information for weighted clustering and reranking top words.
The best performing combination for our approach performs as well as classical topic models, but with lower runtime and computational complexity.
arXiv Detail & Related papers (2020-04-30T16:18:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.