Related papers: Analysing the Impact of Removing Infrequent Words on Topic Quality in LDA Models

Analysing the Impact of Removing Infrequent Words on Topic Quality in LDA Models

URL: http://arxiv.org/abs/2311.14505v1
Date: Fri, 24 Nov 2023 14:20:12 GMT
Title: Analysing the Impact of Removing Infrequent Words on Topic Quality in LDA Models
Authors: Victor Bystrov, Viktoriia Naboka-Krell, Anna Staszewska-Bystrova, Peter Winker
Abstract summary: The paper examines the effects of removing infrequent words for the quality of topics estimated using Latent Dirichlet Allocation. The results indicate that pruning is beneficial and that the share of vocabulary which might be eliminated can be quite considerable.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: An initial procedure in text-as-data applications is text preprocessing. One of the typical steps, which can substantially facilitate computations, consists in removing infrequent words believed to provide limited information about the corpus. Despite popularity of vocabulary pruning, not many guidelines on how to implement it are available in the literature. The aim of the paper is to fill this gap by examining the effects of removing infrequent words for the quality of topics estimated using Latent Dirichlet Allocation. The analysis is based on Monte Carlo experiments taking into account different criteria for infrequent terms removal and various evaluation metrics. The results indicate that pruning is beneficial and that the share of vocabulary which might be eliminated can be quite considerable.

Related papers

The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure [91.01653854955286]
Embedding-based similarity metrics can be influenced by spurious attributes like the text's source or language.<n>This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost.
arXiv Detail & Related papers (2025-07-01T23:17:12Z)
The Empirical Impact of Data Sanitization on Language Models [1.1359551336076306]
This paper empirically analyzes the effects of data sanitization across several benchmark language-modeling tasks. Our results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, typically around 1-5%. For tasks such as comprehension Q&A there is a big drop of >25% in performance observed in redacted queries as compared to the original.
arXiv Detail & Related papers (2024-11-08T21:22:37Z)
An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks [2.3624125155742064]
We propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. The cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms.
arXiv Detail & Related papers (2024-08-28T11:36:29Z)
Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of LLMs for Analyzing Categorical Syllogism [62.571419297164645]
This paper provides a systematic overview of prior works on the logical reasoning ability of large language models for analyzing categorical syllogisms. We first investigate all the possible variations for the categorical syllogisms from a purely logical perspective. We then examine the underlying configurations (i.e., mood and figure) tested by the existing datasets.
arXiv Detail & Related papers (2024-06-26T21:17:20Z)
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords. We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z)
Constructing Vec-tionaries to Extract Message Features from Texts: A Case Study of Moral Appeals [5.336592570916432]
We present an approach to construct vec-tionary measurement tools that boost validated dictionaries with word embeddings. A vec-tionary can produce additional metrics to capture the ambivalence of a message feature beyond its strength in texts.
arXiv Detail & Related papers (2023-12-10T20:37:29Z)
Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs) We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric. Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z)
Just Rank: Rethinking Evaluation with Word and Sentence Similarities [105.5541653811528]
intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade. This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations. We propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks.
arXiv Detail & Related papers (2022-03-05T08:40:05Z)
Clustering and Network Analysis for the Embedding Spaces of Sentences and Sub-Sentences [69.3939291118954]
This paper reports research on a set of comprehensive clustering and network analyses targeting sentence and sub-sentence embedding spaces. Results show that one method generates the most clusterable embeddings. In general, the embeddings of span sub-sentences have better clustering properties than the original sentences.
arXiv Detail & Related papers (2021-10-02T00:47:35Z)
Disentangling Homophemes in Lip Reading using Perplexity Analysis [10.262299768603894]
This paper proposes a new application for the Generative Pre-Training transformer. It serves as a language model to convert visual speech in the form of visemes, to language in the form of words and sentences. The network uses the search for optimal perplexity to perform the viseme-to-word mapping.
arXiv Detail & Related papers (2020-11-28T12:12:17Z)
Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation. This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them. In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z)
Analysis and Evaluation of Language Models for Word Sense Disambiguation [18.001457030065712]
Transformer-based language models have taken many fields in NLP by storm. BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense. BERT and its derivatives dominate most of the existing evaluation benchmarks.
arXiv Detail & Related papers (2020-08-26T15:07:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.