Learning Topic Models: Identifiability and Finite-Sample Analysis
- URL: http://arxiv.org/abs/2110.04232v1
- Date: Fri, 8 Oct 2021 16:35:42 GMT
- Title: Learning Topic Models: Identifiability and Finite-Sample Analysis
- Authors: Yinyin Chen, Shishuang He, Yun Yang and Feng Liang
- Abstract summary: We propose a maximum likelihood estimator (MLE) of latent topics based on a specific integrated likelihood.
We conclude with empirical studies on both simulated and real datasets.
- Score: 6.181048261489101
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Topic models provide a useful text-mining tool for learning, extracting and
discovering latent structures in large text corpora. Although a plethora of
methods have been proposed for topic modeling, a formal theoretical
investigation on the statistical identifiability and accuracy of latent topic
estimation is lacking in the literature. In this paper, we propose a maximum
likelihood estimator (MLE) of latent topics based on a specific integrated
likelihood, which is naturally connected to the concept of volume minimization
in computational geometry. Theoretically, we introduce a new set of geometric
conditions for topic model identifiability, which are weaker than conventional
separability conditions relying on the existence of anchor words or pure topic
documents. We conduct finite-sample error analysis for the proposed estimator
and discuss the connection of our results with existing ones. We conclude with
empirical studies on both simulated and real datasets.
Related papers
- Reliability of Topic Modeling [0.3759936323189418]
We show that the standard practice for quantifying topic model reliability fails to capture essential aspects of the variation in two widely-used topic models.
On synthetic and real-world data, we show that McDonald's $omega$ provides the best encapsulation of reliability.
arXiv Detail & Related papers (2024-10-30T16:42:04Z) - The Foundations of Tokenization: Statistical and Computational Concerns [51.370165245628975]
Tokenization is a critical step in the NLP pipeline.
Despite its recognized importance as a standard representation method in NLP, the theoretical underpinnings of tokenization are not yet fully understood.
The present paper contributes to addressing this theoretical gap by proposing a unified formal framework for representing and analyzing tokenizer models.
arXiv Detail & Related papers (2024-07-16T11:12:28Z) - Interactive Topic Models with Optimal Transport [75.26555710661908]
We present EdTM, as an approach for label name supervised topic modeling.
EdTM models topic modeling as an assignment problem while leveraging LM/LLM based document-topic affinities.
arXiv Detail & Related papers (2024-06-28T13:57:27Z) - The Geometric Structure of Topic Models [0.0]
Despite their widespread use in research and application, an in-depth analysis of topic models is still an open research topic.
We propose an incidence-geometric method for deriving an ordinal structure from flat topic models.
We present a new visualization paradigm for concept hierarchies based on ordinal motifs.
arXiv Detail & Related papers (2024-03-06T10:53:51Z) - How Well Do Text Embedding Models Understand Syntax? [50.440590035493074]
The ability of text embedding models to generalize across a wide range of syntactic contexts remains under-explored.
Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges.
We propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios.
arXiv Detail & Related papers (2023-11-14T08:51:00Z) - Discovering Interpretable Physical Models using Symbolic Regression and
Discrete Exterior Calculus [55.2480439325792]
We propose a framework that combines Symbolic Regression (SR) and Discrete Exterior Calculus (DEC) for the automated discovery of physical models.
DEC provides building blocks for the discrete analogue of field theories, which are beyond the state-of-the-art applications of SR to physical problems.
We prove the effectiveness of our methodology by re-discovering three models of Continuum Physics from synthetic experimental data.
arXiv Detail & Related papers (2023-10-10T13:23:05Z) - Topics in the Haystack: Extracting and Evaluating Topics beyond
Coherence [0.0]
We propose a method that incorporates a deeper understanding of both sentence and document themes.
This allows our model to detect latent topics that may include uncommon words or neologisms.
We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task.
arXiv Detail & Related papers (2023-03-30T12:24:25Z) - Model-agnostic multi-objective approach for the evolutionary discovery
of mathematical models [55.41644538483948]
In modern data science, it is more interesting to understand the properties of the model, which parts could be replaced to obtain better results.
We use multi-objective evolutionary optimization for composite data-driven model learning to obtain the algorithm's desired properties.
arXiv Detail & Related papers (2021-07-07T11:17:09Z) - Amortized Bayesian model comparison with evidential deep learning [0.12314765641075436]
We propose a novel method for performing Bayesian model comparison using specialized deep learning architectures.
Our method is purely simulation-based and circumvents the step of explicitly fitting all alternative models under consideration to each observed dataset.
We show that our method achieves excellent results in terms of accuracy, calibration, and efficiency across the examples considered in this work.
arXiv Detail & Related papers (2020-04-22T15:15:46Z) - How Far are We from Effective Context Modeling? An Exploratory Study on
Semantic Parsing in Context [59.13515950353125]
We present a grammar-based decoding semantic parsing and adapt typical context modeling methods on top of it.
We evaluate 13 context modeling methods on two large cross-domain datasets, and our best model achieves state-of-the-art performances.
arXiv Detail & Related papers (2020-02-03T11:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.