ProSiT! Latent Variable Discovery with PROgressive SImilarity Thresholds
- URL: http://arxiv.org/abs/2210.14763v1
- Date: Wed, 26 Oct 2022 14:52:44 GMT
- Title: ProSiT! Latent Variable Discovery with PROgressive SImilarity Thresholds
- Authors: Tommaso Fornaciari, Dirk Hovy, Federico Bianchi
- Abstract summary: ProSiT is a deterministic and interpretable method that finds the optimal number of latent dimensions.
In most setting, ProSiT matches or outperforms the other methods in terms of topic coherence and distinctiveness.
- Score: 35.09631990817093
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The most common ways to explore latent document dimensions are topic models
and clustering methods. However, topic models have several drawbacks: e.g.,
they require us to choose the number of latent dimensions a priori, and the
results are stochastic. Most clustering methods have the same issues and lack
flexibility in various ways, such as not accounting for the influence of
different topics on single documents, forcing word-descriptors to belong to a
single topic (hard-clustering) or necessarily relying on word representations.
We propose PROgressive SImilarity Thresholds - ProSiT, a deterministic and
interpretable method, agnostic to the input format, that finds the optimal
number of latent dimensions and only has two hyper-parameters, which can be set
efficiently via grid search. We compare this method with a wide range of topic
models and clustering methods on four benchmark data sets. In most setting,
ProSiT matches or outperforms the other methods in terms six metrics of topic
coherence and distinctiveness, producing replicable, deterministic results.
Related papers
- Explaining Datasets in Words: Statistical Models with Natural Language Parameters [66.69456696878842]
We introduce a family of statistical models -- including clustering, time series, and classification models -- parameterized by natural language predicates.
We apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other.
arXiv Detail & Related papers (2024-09-13T01:40:20Z) - Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation [60.493180081319785]
We propose a systematic way to estimate the intrinsic capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step.
Our work provides a comprehensive comparison between existing truncation sampling methods, as well as their recommended parameters as a guideline for users.
arXiv Detail & Related papers (2024-08-24T14:14:32Z) - Self Supervised Correlation-based Permutations for Multi-View Clustering [7.972599673048582]
We propose an end-to-end deep learning-based MVC framework for general data.
Our approach involves learning meaningful fused data representations with a novel permutation-based canonical correlation objective.
We demonstrate the effectiveness of our model using ten MVC benchmark datasets.
arXiv Detail & Related papers (2024-02-26T08:08:30Z) - High-dimensional variable clustering based on maxima of a weakly dependent random process [1.1999555634662633]
We propose a new class of models for variable clustering called Asymptotic Independent block (AI-block) models.
This class of models is identifiable, meaning that there exists a maximal element with a partial order between partitions, allowing for statistical inference.
We also present an algorithm depending on a tuning parameter that recovers the clusters of variables without specifying the number of clusters empha priori.
arXiv Detail & Related papers (2023-02-02T08:24:26Z) - A parallelizable model-based approach for marginal and multivariate
clustering [0.0]
This paper develops a clustering method that takes advantage of the sturdiness of model-based clustering.
We tackle this issue by specifying a finite mixture model per margin that allows each margin to have a different number of clusters.
The proposed approach is computationally appealing as well as more tractable for moderate to high dimensions than a full' (joint) model-based clustering approach.
arXiv Detail & Related papers (2022-12-07T23:54:41Z) - A One-shot Framework for Distributed Clustered Learning in Heterogeneous
Environments [54.172993875654015]
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments.
One-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees.
For strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error rates in terms of the sample size.
arXiv Detail & Related papers (2022-09-22T09:04:10Z) - Personalized Federated Learning via Convex Clustering [72.15857783681658]
We propose a family of algorithms for personalized federated learning with locally convex user costs.
The proposed framework is based on a generalization of convex clustering in which the differences between different users' models are penalized.
arXiv Detail & Related papers (2022-02-01T19:25:31Z) - Selecting the number of clusters, clustering models, and algorithms. A
unifying approach based on the quadratic discriminant score [0.5330240017302619]
We propose a selection rule that allows choosing among many clustering solutions.
The proposed method has the distinctive advantage that it can compare partitions that cannot be compared with other state-of-the-art methods.
arXiv Detail & Related papers (2021-11-03T15:38:58Z) - Multilayer Networks for Text Analysis with Multiple Data Types [0.21108097398435335]
We propose a novel framework based on Multilayer Networks and Block Models.
We show that taking into account multiple types of information provides a more nuanced view on topic- and document-clusters.
arXiv Detail & Related papers (2021-06-30T05:47:39Z) - Conjoined Dirichlet Process [63.89763375457853]
We develop a novel, non-parametric probabilistic biclustering method based on Dirichlet processes to identify biclusters with strong co-occurrence in both rows and columns.
We apply our method to two different applications, text mining and gene expression analysis, and demonstrate that our method improves bicluster extraction in many settings compared to existing approaches.
arXiv Detail & Related papers (2020-02-08T19:41:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.