Analysing the Impact of Removing Infrequent Words on Topic Quality in
LDA Models
- URL: http://arxiv.org/abs/2311.14505v1
- Date: Fri, 24 Nov 2023 14:20:12 GMT
- Title: Analysing the Impact of Removing Infrequent Words on Topic Quality in
LDA Models
- Authors: Victor Bystrov, Viktoriia Naboka-Krell, Anna Staszewska-Bystrova,
Peter Winker
- Abstract summary: The paper examines the effects of removing infrequent words for the quality of topics estimated using Latent Dirichlet Allocation.
The results indicate that pruning is beneficial and that the share of vocabulary which might be eliminated can be quite considerable.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: An initial procedure in text-as-data applications is text preprocessing. One
of the typical steps, which can substantially facilitate computations, consists
in removing infrequent words believed to provide limited information about the
corpus. Despite popularity of vocabulary pruning, not many guidelines on how to
implement it are available in the literature. The aim of the paper is to fill
this gap by examining the effects of removing infrequent words for the quality
of topics estimated using Latent Dirichlet Allocation. The analysis is based on
Monte Carlo experiments taking into account different criteria for infrequent
terms removal and various evaluation metrics. The results indicate that pruning
is beneficial and that the share of vocabulary which might be eliminated can be
quite considerable.
Related papers
- The Empirical Impact of Data Sanitization on Language Models [1.1359551336076306]
This paper empirically analyzes the effects of data sanitization across several benchmark language-modeling tasks.
Our results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, typically around 1-5%.
For tasks such as comprehension Q&A there is a big drop of >25% in performance observed in redacted queries as compared to the original.
arXiv Detail & Related papers (2024-11-08T21:22:37Z) - An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks [2.3624125155742064]
We propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources.
We design a preprocessing pipeline for the filtration of unwanted text from crawled data.
The cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms.
arXiv Detail & Related papers (2024-08-28T11:36:29Z) - Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of LLMs for Analyzing Categorical Syllogism [62.571419297164645]
This paper provides a systematic overview of prior works on the logical reasoning ability of large language models for analyzing categorical syllogisms.
We first investigate all the possible variations for the categorical syllogisms from a purely logical perspective.
We then examine the underlying configurations (i.e., mood and figure) tested by the existing datasets.
arXiv Detail & Related papers (2024-06-26T21:17:20Z) - An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - Constructing Vec-tionaries to Extract Message Features from Texts: A
Case Study of Moral Appeals [5.336592570916432]
We present an approach to construct vec-tionary measurement tools that boost validated dictionaries with word embeddings.
A vec-tionary can produce additional metrics to capture the ambivalence of a message feature beyond its strength in texts.
arXiv Detail & Related papers (2023-12-10T20:37:29Z) - Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)
We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.
Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z) - Just Rank: Rethinking Evaluation with Word and Sentence Similarities [105.5541653811528]
intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade.
This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations.
We propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks.
arXiv Detail & Related papers (2022-03-05T08:40:05Z) - Clustering and Network Analysis for the Embedding Spaces of Sentences
and Sub-Sentences [69.3939291118954]
This paper reports research on a set of comprehensive clustering and network analyses targeting sentence and sub-sentence embedding spaces.
Results show that one method generates the most clusterable embeddings.
In general, the embeddings of span sub-sentences have better clustering properties than the original sentences.
arXiv Detail & Related papers (2021-10-02T00:47:35Z) - Disentangling Homophemes in Lip Reading using Perplexity Analysis [10.262299768603894]
This paper proposes a new application for the Generative Pre-Training transformer.
It serves as a language model to convert visual speech in the form of visemes, to language in the form of words and sentences.
The network uses the search for optimal perplexity to perform the viseme-to-word mapping.
arXiv Detail & Related papers (2020-11-28T12:12:17Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - Analysis and Evaluation of Language Models for Word Sense Disambiguation [18.001457030065712]
Transformer-based language models have taken many fields in NLP by storm.
BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense.
BERT and its derivatives dominate most of the existing evaluation benchmarks.
arXiv Detail & Related papers (2020-08-26T15:07:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.