Improving Reliability of Latent Dirichlet Allocation by Assessing Its
Stability Using Clustering Techniques on Replicated Runs
- URL: http://arxiv.org/abs/2003.04980v1
- Date: Fri, 14 Feb 2020 07:10:18 GMT
- Title: Improving Reliability of Latent Dirichlet Allocation by Assessing Its
Stability Using Clustering Techniques on Replicated Runs
- Authors: Jonas Rieger, Lars Koppers, Carsten Jentsch, and J\"org Rahnenf\"uhrer
- Abstract summary: We study the stability of LDA by comparing assignments from replicated runs.
We propose to quantify the similarity of two generated topics by a modified Jaccard coefficient.
We show that the measure S-CLOP is useful for assessing the stability of LDA models.
- Score: 0.3499870393443268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For organizing large text corpora topic modeling provides useful tools. A
widely used method is Latent Dirichlet Allocation (LDA), a generative
probabilistic model which models single texts in a collection of texts as
mixtures of latent topics. The assignments of words to topics rely on initial
values such that generally the outcome of LDA is not fully reproducible. In
addition, the reassignment via Gibbs Sampling is based on conditional
distributions, leading to different results in replicated runs on the same text
data. This fact is often neglected in everyday practice. We aim to improve the
reliability of LDA results. Therefore, we study the stability of LDA by
comparing assignments from replicated runs. We propose to quantify the
similarity of two generated topics by a modified Jaccard coefficient. Using
such similarities, topics can be clustered. A new pruning algorithm for
hierarchical clustering results based on the idea that two LDA runs create
pairs of similar topics is proposed. This approach leads to the new measure
S-CLOP ({\bf S}imilarity of multiple sets by {\bf C}lustering with {\bf LO}cal
{\bf P}runing) for quantifying the stability of LDA models. We discuss some
characteristics of this measure and illustrate it with an application to real
data consisting of newspaper articles from \textit{USA Today}. Our results show
that the measure S-CLOP is useful for assessing the stability of LDA models or
any other topic modeling procedure that characterize its topics by word
distributions. Based on the newly proposed measure for LDA stability, we
propose a method to increase the reliability and hence to improve the
reproducibility of empirical findings based on topic modeling. This increase in
reliability is obtained by running the LDA several times and taking as
prototype the most representative run, that is the LDA run with highest average
similarity to all other runs.
Related papers
- Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - Dataset Condensation with Latent Quantile Matching [5.466962214217334]
Current distribution matching (DM) based DC methods learn a synthesized dataset by matching the mean of the latent embeddings between the synthetic and the real outliers.
We propose Latent Quantile Matching (LQM) which matches the quantiles of the latent embeddings to minimize the goodness of fit test statistic between two distributions.
arXiv Detail & Related papers (2024-06-14T09:20:44Z) - Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC)
LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses.
LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z) - Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors.
We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval.
Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets.
Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z) - Sample Complexity Characterization for Linear Contextual MDPs [67.79455646673762]
Contextual decision processes (CMDPs) describe a class of reinforcement learning problems in which the transition kernels and reward functions can change over time with different MDPs indexed by a context variable.
CMDPs serve as an important framework to model many real-world applications with time-varying environments.
We study CMDPs under two linear function approximation models: Model I with context-varying representations and common linear weights for all contexts; and Model II with common representations for all contexts and context-varying linear weights.
arXiv Detail & Related papers (2024-02-05T03:25:04Z) - Language as a Latent Sequence: deep latent variable models for
semi-supervised paraphrase generation [47.33223015862104]
We present a novel unsupervised model named variational sequence auto-encoding reconstruction (VSAR), which performs latent sequence inference given an observed text.
To leverage information from text pairs, we additionally introduce a novel supervised model we call dual directional learning (DDL), which is designed to integrate with our proposed VSAR model.
Our empirical evaluations suggest that the combined model yields competitive performance against the state-of-the-art supervised baselines on complete data.
arXiv Detail & Related papers (2023-01-05T19:35:30Z) - Towards Realistic Low-resource Relation Extraction: A Benchmark with
Empirical Baseline Study [51.33182775762785]
This paper presents an empirical study to build relation extraction systems in low-resource settings.
We investigate three schemes to evaluate the performance in low-resource settings: (i) different types of prompt-based methods with few-shot labeled data; (ii) diverse balancing methods to address the long-tailed distribution issue; and (iii) data augmentation technologies and self-training to generate more labeled in-domain data.
arXiv Detail & Related papers (2022-10-19T15:46:37Z) - Making a (Counterfactual) Difference One Rationale at a Time [5.97507595130844]
We investigate whether counterfactual data augmentation, without human assistance, can improve the performance of the selector.
Our results show that CDA produces rationales that better capture the signal of interest.
arXiv Detail & Related papers (2022-01-13T19:05:02Z) - ALBU: An approximate Loopy Belief message passing algorithm for LDA to
improve performance on small data sets [3.5027291542274366]
We present a novel variational message passing algorithm as applied to Latent Dirichlet Allocation (LDA)
We compare it with the gold standard VB and collapsed Gibbs sampling algorithms.
Using coherence measures for the text corpora and KLD with the simulations we show that ALBU learns latent distributions more accurately than does VB.
arXiv Detail & Related papers (2021-10-01T19:55:12Z) - SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval [11.38022203865326]
SPLADE model provides highly sparse representations and competitive results with respect to state-of-the-art dense and sparse approaches.
We modify the pooling mechanism, benchmark a model solely based on document expansion, and introduce models trained with distillation.
Overall, SPLADE is considerably improved with more than $9$% gains on NDCG@10 on TREC DL 2019, leading to state-of-the-art results on the BEIR benchmark.
arXiv Detail & Related papers (2021-09-21T10:43:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.