Large-Scale Evaluation of Topic Models and Dimensionality Reduction
Methods for 2D Text Spatialization
- URL: http://arxiv.org/abs/2307.11770v1
- Date: Mon, 17 Jul 2023 14:08:25 GMT
- Title: Large-Scale Evaluation of Topic Models and Dimensionality Reduction
Methods for 2D Text Spatialization
- Authors: Daniel Atzberger, Tim Cech, Willy Scheibel, Matthias Trapp, Rico
Richter, J\"urgen D\"ollner, Tobias Schreck
- Abstract summary: We show that interpretable topic models are beneficial for capturing the structure of text corpora.
We propose guidelines for the effective design of text spatializations based on topic models and dimensionality reductions.
- Score: 2.6034734004409303
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Topic models are a class of unsupervised learning algorithms for detecting
the semantic structure within a text corpus. Together with a subsequent
dimensionality reduction algorithm, topic models can be used for deriving
spatializations for text corpora as two-dimensional scatter plots, reflecting
semantic similarity between the documents and supporting corpus analysis.
Although the choice of the topic model, the dimensionality reduction, and their
underlying hyperparameters significantly impact the resulting layout, it is
unknown which particular combinations result in high-quality layouts with
respect to accuracy and perception metrics. To investigate the effectiveness of
topic models and dimensionality reduction methods for the spatialization of
corpora as two-dimensional scatter plots (or basis for landscape-type
visualizations), we present a large-scale, benchmark-based computational
evaluation. Our evaluation consists of (1) a set of corpora, (2) a set of
layout algorithms that are combinations of topic models and dimensionality
reductions, and (3) quality metrics for quantifying the resulting layout. The
corpora are given as document-term matrices, and each document is assigned to a
thematic class. The chosen metrics quantify the preservation of local and
global properties and the perceptual effectiveness of the two-dimensional
scatter plots. By evaluating the benchmark on a computing cluster, we derived a
multivariate dataset with over 45 000 individual layouts and corresponding
quality metrics. Based on the results, we propose guidelines for the effective
design of text spatializations that are based on topic models and
dimensionality reductions. As a main result, we show that interpretable topic
models are beneficial for capturing the structure of text corpora. We
furthermore recommend the use of t-SNE as a subsequent dimensionality
reduction.
Related papers
- A Large-Scale Sensitivity Analysis on Latent Embeddings and Dimensionality Reductions for Text Spatializations [4.810926556822174]
The semantic similarity between documents of a text corpus can be visualized using map-like metaphors.
These scatterplot layouts result from a dimensionality reduction on the document-term matrix or a representation within a latent embedding.
We present a sensitivity study that analyzes the stability of these layouts concerning changes in the text corpora.
arXiv Detail & Related papers (2024-07-25T08:46:49Z) - A new visual quality metric for Evaluating the performance of multidimensional projections [1.6574413179773757]
We propose a new visual quality metric based on human perception.
We show that the proposed metric produces more precise results in analyzing the quality of MP than other previously used metrics.
arXiv Detail & Related papers (2024-07-23T09:02:46Z) - Quantization of Large Language Models with an Overdetermined Basis [73.79368761182998]
We introduce an algorithm for data quantization based on the principles of Kashin representation.
Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance.
arXiv Detail & Related papers (2024-04-15T12:38:46Z) - Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - ShaRP: Shape-Regularized Multidimensional Projections [71.30697308446064]
We present a novel projection technique - ShaRP - that provides users explicit control over the visual signature of the created scatterplot.
ShaRP scales well with dimensionality and dataset size, and generically handles any quantitative dataset.
arXiv Detail & Related papers (2023-06-01T11:16:58Z) - The Deep Latent Position Topic Model for Clustering and Representation
of Networks with Textual Edges [2.6334900941196087]
Deep-LPTM is a model-based clustering strategy based on a variational graph auto-encoder approach.
The emails of the Enron company are analysed and visualisations of the results are presented.
arXiv Detail & Related papers (2023-04-14T07:01:57Z) - Optimal Discriminant Analysis in High-Dimensional Latent Factor Models [1.4213973379473654]
In high-dimensional classification problems, a commonly used approach is to first project the high-dimensional features into a lower dimensional space.
We formulate a latent-variable model with a hidden low-dimensional structure to justify this two-step procedure.
We propose a computationally efficient classifier that takes certain principal components (PCs) of the observed features as projections.
arXiv Detail & Related papers (2022-10-23T21:45:53Z) - CCP: Correlated Clustering and Projection for Dimensionality Reduction [5.992724190105578]
Correlated Clustering and Projection offers a novel data domain strategy that does not need to solve any matrix.
CCP partitions high-dimensional features into correlated clusters and then projects correlated features in each cluster into a one-dimensional representation.
Proposed methods are validated with benchmark datasets associated with various machine learning algorithms.
arXiv Detail & Related papers (2022-06-08T23:14:44Z) - Set Based Stochastic Subsampling [85.5331107565578]
We propose a set-based two-stage end-to-end neural subsampling model that is jointly optimized with an textitarbitrary downstream task network.
We show that it outperforms the relevant baselines under low subsampling rates on a variety of tasks including image classification, image reconstruction, function reconstruction and few-shot classification.
arXiv Detail & Related papers (2020-06-25T07:36:47Z) - Two-Dimensional Semi-Nonnegative Matrix Factorization for Clustering [50.43424130281065]
We propose a new Semi-Nonnegative Matrix Factorization method for 2-dimensional (2D) data, named TS-NMF.
It overcomes the drawback of existing methods that seriously damage the spatial information of the data by converting 2D data to vectors in a preprocessing step.
arXiv Detail & Related papers (2020-05-19T05:54:14Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.