The study of short texts in digital politics: Document aggregation for topic modeling
- URL: http://arxiv.org/abs/2503.05065v1
- Date: Fri, 07 Mar 2025 01:05:46 GMT
- Title: The study of short texts in digital politics: Document aggregation for topic modeling
- Authors: Nitheesha Nakka, Omer F. Yalcin, Bruce A. Desmarais, Sarah Rajtmajer, Burt Monroe,
- Abstract summary: We investigate the effects of aggregating short documents into larger ones based on natural units that partition the corpus.<n>We analyze one million tweets by U.S. state legislators from April 2016 to September 2020.<n>For documents aggregated at the account level, topics are more associated with individual states than when using individual tweets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Statistical topic modeling is widely used in political science to study text. Researchers examine documents of varying lengths, from tweets to speeches. There is ongoing debate on how document length affects the interpretability of topic models. We investigate the effects of aggregating short documents into larger ones based on natural units that partition the corpus. In our study, we analyze one million tweets by U.S. state legislators from April 2016 to September 2020. We find that for documents aggregated at the account level, topics are more associated with individual states than when using individual tweets. This finding is replicated with Wikipedia pages aggregated by birth cities, showing how document definitions can impact topic modeling results.
Related papers
- From Small to Large Language Models: Revisiting the Federalist Papers [0.0]
We review some of the more popular Large Language Model (LLM) tools and examine them from a statistical point of view in the context of text classification.<n>We investigate whether, without any attempt to fine-tune, the general embedding constructs can be useful for stylometry and attribution.
arXiv Detail & Related papers (2025-02-25T21:50:46Z) - Locating Information Gaps and Narrative Inconsistencies Across Languages: A Case Study of LGBT People Portrayals on Wikipedia [49.80565462746646]
We introduce the InfoGap method -- an efficient and reliable approach to locating information gaps and inconsistencies in articles at the fact level.
We evaluate InfoGap by analyzing LGBT people's portrayals, across 2.7K biography pages on English, Russian, and French Wikipedias.
arXiv Detail & Related papers (2024-10-05T20:40:49Z) - CausalCite: A Causal Formulation of Paper Citations [80.82622421055734]
CausalCite is a new way to measure the significance of a paper by assessing the causal impact of the paper on its follow-up papers.
It is based on a novel causal inference method, TextMatch, which adapts the traditional matching framework to high-dimensional text embeddings.
We demonstrate the effectiveness of CausalCite on various criteria, such as high correlation with paper impact as reported by scientific experts.
arXiv Detail & Related papers (2023-11-05T23:09:39Z) - Neural Natural Language Processing for Long Texts: A Survey on Classification and Summarization [6.728794938150435]
The adoption of Deep Neural Networks (DNNs) has greatly benefited Natural Language Processing (NLP)
The ever increasing size of documents uploaded online renders automated understanding of lengthy texts a critical issue.
This article serves as an entry point into this dynamic domain and aims to achieve two objectives.
arXiv Detail & Related papers (2023-05-25T17:13:44Z) - Topics in the Haystack: Extracting and Evaluating Topics beyond
Coherence [0.0]
We propose a method that incorporates a deeper understanding of both sentence and document themes.
This allows our model to detect latent topics that may include uncommon words or neologisms.
We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task.
arXiv Detail & Related papers (2023-03-30T12:24:25Z) - Topic Modelling of Swedish Newspaper Articles about Coronavirus: a Case
Study using Latent Dirichlet Allocation Method [8.405827390095064]
Topic Modelling (TM) is from the research branches of natural language understanding (NLU) and natural language processing (NLP)
In this study, we apply popular Latent Dirichlet Allocation (LDA) methods to model the topic changes in Swedish newspaper articles about Coronavirus.
We describe the corpus we created including 6515 articles, methods applied, and statistics on topic changes over approximately 1 year and two months period of time from 17th January 2020 to 13th March 2021.
arXiv Detail & Related papers (2023-01-08T12:33:58Z) - Predicting Long-Term Citations from Short-Term Linguistic Influence [20.78217545537925]
A standard measure of the influence of a research paper is the number of times it is cited.
We propose a novel method to quantify linguistic influence in timestamped document collections.
arXiv Detail & Related papers (2022-10-24T22:03:26Z) - Twitter Topic Classification [15.306383757213956]
We present a new task based on tweet topic classification and release two associated datasets.
Given a wide range of topics covering the most important discussion points in social media, we provide training and testing data.
We perform a quantitative evaluation and analysis of current general- and domain-specific language models on the task.
arXiv Detail & Related papers (2022-09-20T16:13:52Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Multiple Texts as a Limiting Factor in Online Learning: Quantifying
(Dis-)similarities of Knowledge Networks across Languages [60.00219873112454]
We investigate the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted.
Since Wikipedia is a central part of the web-based information landscape, this indicates a language-related, linguistic bias.
The article builds a bridge between reading research, educational science, Wikipedia research and computational linguistics.
arXiv Detail & Related papers (2020-08-05T11:11:55Z) - From Standard Summarization to New Tasks and Beyond: Summarization with
Manifold Information [77.89755281215079]
Text summarization is the research area aiming at creating a short and condensed version of the original document.
In real-world applications, most of the data is not in a plain text format.
This paper focuses on the survey of these new summarization tasks and approaches in the real-world application.
arXiv Detail & Related papers (2020-05-10T14:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.