Related papers: Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs

Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs

URL: http://arxiv.org/abs/2003.04980v1
Date: Fri, 14 Feb 2020 07:10:18 GMT
Title: Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs
Authors: Jonas Rieger, Lars Koppers, Carsten Jentsch, and J\"org Rahnenf\"uhrer
Abstract summary: We study the stability of LDA by comparing assignments from replicated runs. We propose to quantify the similarity of two generated topics by a modified Jaccard coefficient. We show that the measure S-CLOP is useful for assessing the stability of LDA models.
Score: 0.3499870393443268
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: For organizing large text corpora topic modeling provides useful tools. A widely used method is Latent Dirichlet Allocation (LDA), a generative probabilistic model which models single texts in a collection of texts as mixtures of latent topics. The assignments of words to topics rely on initial values such that generally the outcome of LDA is not fully reproducible. In addition, the reassignment via Gibbs Sampling is based on conditional distributions, leading to different results in replicated runs on the same text data. This fact is often neglected in everyday practice. We aim to improve the reliability of LDA results. Therefore, we study the stability of LDA by comparing assignments from replicated runs. We propose to quantify the similarity of two generated topics by a modified Jaccard coefficient. Using such similarities, topics can be clustered. A new pruning algorithm for hierarchical clustering results based on the idea that two LDA runs create pairs of similar topics is proposed. This approach leads to the new measure S-CLOP ({\bf S}imilarity of multiple sets by {\bf C}lustering with {\bf LO}cal {\bf P}runing) for quantifying the stability of LDA models. We discuss some characteristics of this measure and illustrate it with an application to real data consisting of newspaper articles from \textit{USA Today}. Our results show that the measure S-CLOP is useful for assessing the stability of LDA models or any other topic modeling procedure that characterize its topics by word distributions. Based on the newly proposed measure for LDA stability, we propose a method to increase the reliability and hence to improve the reproducibility of empirical findings based on topic modeling. This increase in reliability is obtained by running the LDA several times and taking as prototype the most representative run, that is the LDA run with highest average similarity to all other runs.

Related papers

Semantic-Augmented Latent Topic Modeling with LLM-in-the-Loop [9.763247646329392]
Latent Dirichlet Allocation (LDA) is a prominent generative probabilistic model used for uncovering abstract topics within document collections.<n>In this paper, we explore the effectiveness of augmenting topic models with Large Language Models (LLMs) through integration into two key phases: Initialization and Post-Correction.
arXiv Detail & Related papers (2025-07-11T11:20:39Z)
E-LDA: Toward Interpretable LDA Topic Models with Strong Guarantees in Logarithmic Parallel Time [2.810160553339817]
We provide the first practical algorithms with provable guarantees for the problem of inferring the topics assigned to each document in an LDA topic model.<n>This is the primary inference problem for many applications of topic models in social science, data exploration, and causal inference settings.
arXiv Detail & Related papers (2025-06-06T07:05:48Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Scaling Test-Time Compute Without Verification or RL is Suboptimal [70.28430200655919]
We show that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. We corroborate our theory empirically on both didactic and math reasoning problems with 3/8B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
arXiv Detail & Related papers (2025-02-17T18:43:24Z)
Large Language Diffusion Models [77.02553707673418]
Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs)<n>We introduce LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning paradigm.<n>Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines.
arXiv Detail & Related papers (2025-02-14T08:23:51Z)
Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains. We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z)
Dataset Condensation with Latent Quantile Matching [5.466962214217334]
Current distribution matching (DM) based DC methods learn a synthesized dataset by matching the mean of the latent embeddings between the synthetic and the real outliers. We propose Latent Quantile Matching (LQM) which matches the quantiles of the latent embeddings to minimize the goodness of fit test statistic between two distributions.
arXiv Detail & Related papers (2024-06-14T09:20:44Z)
Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC) LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses. LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z)
Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z)
Sample Complexity Characterization for Linear Contextual MDPs [67.79455646673762]
Contextual decision processes (CMDPs) describe a class of reinforcement learning problems in which the transition kernels and reward functions can change over time with different MDPs indexed by a context variable. CMDPs serve as an important framework to model many real-world applications with time-varying environments. We study CMDPs under two linear function approximation models: Model I with context-varying representations and common linear weights for all contexts; and Model II with common representations for all contexts and context-varying linear weights.
arXiv Detail & Related papers (2024-02-05T03:25:04Z)
Optimal Multitask Linear Regression and Contextual Bandits under Sparse Heterogeneity [41.772562538698395]
Multitask learning methods improve efficiency by leveraging commonalities across datasets. We study multitask linear regression and contextual bandits under sparse heterogeneity. We show that our methods are minimax optimal by providing a number of lower bounds.
arXiv Detail & Related papers (2023-06-09T22:48:13Z)
Language as a Latent Sequence: deep latent variable models for semi-supervised paraphrase generation [47.33223015862104]
We present a novel unsupervised model named variational sequence auto-encoding reconstruction (VSAR), which performs latent sequence inference given an observed text. To leverage information from text pairs, we additionally introduce a novel supervised model we call dual directional learning (DDL), which is designed to integrate with our proposed VSAR model. Our empirical evaluations suggest that the combined model yields competitive performance against the state-of-the-art supervised baselines on complete data.
arXiv Detail & Related papers (2023-01-05T19:35:30Z)
Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study [51.33182775762785]
This paper presents an empirical study to build relation extraction systems in low-resource settings. We investigate three schemes to evaluate the performance in low-resource settings: (i) different types of prompt-based methods with few-shot labeled data; (ii) diverse balancing methods to address the long-tailed distribution issue; and (iii) data augmentation technologies and self-training to generate more labeled in-domain data.
arXiv Detail & Related papers (2022-10-19T15:46:37Z)
Making a (Counterfactual) Difference One Rationale at a Time [5.97507595130844]
We investigate whether counterfactual data augmentation, without human assistance, can improve the performance of the selector. Our results show that CDA produces rationales that better capture the signal of interest.
arXiv Detail & Related papers (2022-01-13T19:05:02Z)
ALBU: An approximate Loopy Belief message passing algorithm for LDA to improve performance on small data sets [3.5027291542274366]
We present a novel variational message passing algorithm as applied to Latent Dirichlet Allocation (LDA) We compare it with the gold standard VB and collapsed Gibbs sampling algorithms. Using coherence measures for the text corpora and KLD with the simulations we show that ALBU learns latent distributions more accurately than does VB.
arXiv Detail & Related papers (2021-10-01T19:55:12Z)
SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval [11.38022203865326]
SPLADE model provides highly sparse representations and competitive results with respect to state-of-the-art dense and sparse approaches. We modify the pooling mechanism, benchmark a model solely based on document expansion, and introduce models trained with distillation. Overall, SPLADE is considerably improved with more than $9$% gains on NDCG@10 on TREC DL 2019, leading to state-of-the-art results on the BEIR benchmark.
arXiv Detail & Related papers (2021-09-21T10:43:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.