Improving Topic Modeling of Social Media Short Texts with Rephrasing: A Case Study of COVID-19 Related Tweets
- URL: http://arxiv.org/abs/2510.18908v1
- Date: Tue, 21 Oct 2025 03:29:38 GMT
- Title: Improving Topic Modeling of Social Media Short Texts with Rephrasing: A Case Study of COVID-19 Related Tweets
- Authors: Wangjiaxuan Xin, Shuhua Yin, Shi Chen, Yaorong Ge,
- Abstract summary: Shortness, informality, and noise of social media short texts often hinder the effectiveness of traditional topic modeling.<n>We have developed emphTM-Rephrase, a model-agnostic framework that rephrases raw tweets into more standardized and formal language prior to topic modeling.<n>This study contributes to a model-agnostic approach to enhancing topic modeling in public health related social media analysis.
- Score: 2.073927793507761
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Social media platforms such as Twitter (now X) provide rich data for analyzing public discourse, especially during crises such as the COVID-19 pandemic. However, the brevity, informality, and noise of social media short texts often hinder the effectiveness of traditional topic modeling, producing incoherent or redundant topics that are often difficult to interpret. To address these challenges, we have developed \emph{TM-Rephrase}, a model-agnostic framework that leverages large language models (LLMs) to rephrase raw tweets into more standardized and formal language prior to topic modeling. Using a dataset of 25,027 COVID-19-related Twitter posts, we investigate the effects of two rephrasing strategies, general- and colloquial-to-formal-rephrasing, on multiple topic modeling methods. Results demonstrate that \emph{TM-Rephrase} improves three metrics measuring topic modeling performance (i.e., topic coherence, topic uniqueness, and topic diversity) while reducing topic redundancy of most topic modeling algorithms, with the colloquial-to-formal strategy yielding the greatest performance gains and especially for the Latent Dirichlet Allocation (LDA) algorithm. This study contributes to a model-agnostic approach to enhancing topic modeling in public health related social media analysis, with broad implications for improved understanding of public discourse in health crisis as well as other important domains.
Related papers
- Enhancing Short-Text Topic Modeling with LLM-Driven Context Expansion and Prefix-Tuned VAEs [25.915607750636333]
We propose a novel approach that leverages large language models (LLMs) to extend short texts into more detailed sequences before applying topic modeling.
Our method significantly improves short-text topic modeling performance, as demonstrated by extensive experiments on real-world datasets with extreme data sparsity.
arXiv Detail & Related papers (2024-10-04T01:28:56Z) - A Large Language Model Guided Topic Refinement Mechanism for Short Text Modeling [10.589126787499973]
Existing topic models often struggle to accurately capture the underlying semantic patterns of short texts.<n>This paper introduces a novel model-agnostic mechanism, termed Topic Refinement.<n>We show that Topic Refinement boosts the topic quality and improves the performance in topic-related text classification tasks.
arXiv Detail & Related papers (2024-03-26T13:50:34Z) - How Well Do Text Embedding Models Understand Syntax? [50.440590035493074]
The ability of text embedding models to generalize across a wide range of syntactic contexts remains under-explored.
Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges.
We propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios.
arXiv Detail & Related papers (2023-11-14T08:51:00Z) - TopicGPT: A Prompt-based Topic Modeling Framework [77.72072691307811]
We introduce TopicGPT, a prompt-based framework that uses large language models to uncover latent topics in a text collection.
It produces topics that align better with human categorizations compared to competing methods.
Its topics are also interpretable, dispensing with ambiguous bags of words in favor of topics with natural language labels and associated free-form descriptions.
arXiv Detail & Related papers (2023-11-02T17:57:10Z) - Let the Pretrained Language Models "Imagine" for Short Texts Topic
Modeling [29.87929724277381]
In short texts, co-occurrence information is minimal, which results in feature sparsity in document representation.
Existing topic models (probabilistic or neural) mostly fail to mine patterns from them to generate coherent topics.
We extend short text into longer sequences using existing pre-trained language models (PLMs)
arXiv Detail & Related papers (2023-10-24T00:23:30Z) - Knowledge-Aware Bayesian Deep Topic Model [50.58975785318575]
We propose a Bayesian generative model for incorporating prior domain knowledge into hierarchical topic modeling.
Our proposed model efficiently integrates the prior knowledge and improves both hierarchical topic discovery and document representation.
arXiv Detail & Related papers (2022-09-20T09:16:05Z) - Topic Discovery via Latent Space Clustering of Pretrained Language Model
Representations [35.74225306947918]
We propose a joint latent space learning and clustering framework built upon PLM embeddings.
Our model effectively leverages the strong representation power and superb linguistic features brought by PLMs for topic discovery.
arXiv Detail & Related papers (2022-02-09T17:26:08Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Topic modelling discourse dynamics in historical newspapers [2.978993130750125]
We apply two families of topic models (LDA and DTM) on a relatively large set of historical newspapers in Finland.
Our case study focuses on newspapers and periodicals published in Finland between 1854 and 1917, but our method can easily be transposed to any diachronic data.
arXiv Detail & Related papers (2020-11-20T14:51:07Z) - Topic-Aware Multi-turn Dialogue Modeling [91.52820664879432]
This paper presents a novel solution for multi-turn dialogue modeling, which segments and extracts topic-aware utterances in an unsupervised way.
Our topic-aware modeling is implemented by a newly proposed unsupervised topic-aware segmentation algorithm and Topic-Aware Dual-attention Matching (TADAM) Network.
arXiv Detail & Related papers (2020-09-26T08:43:06Z) - How Far are We from Effective Context Modeling? An Exploratory Study on
Semantic Parsing in Context [59.13515950353125]
We present a grammar-based decoding semantic parsing and adapt typical context modeling methods on top of it.
We evaluate 13 context modeling methods on two large cross-domain datasets, and our best model achieves state-of-the-art performances.
arXiv Detail & Related papers (2020-02-03T11:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.