Author Clustering and Topic Estimation for Short Texts
- URL: http://arxiv.org/abs/2106.09533v1
- Date: Tue, 15 Jun 2021 20:55:55 GMT
- Title: Author Clustering and Topic Estimation for Short Texts
- Authors: Graham Tierney and Christopher Bail and Alexander Volfovsky
- Abstract summary: We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
- Score: 69.54017251622211
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Analysis of short text, such as social media posts, is extremely difficult
because it relies on observing many document-level word co-occurrence pairs.
Beyond topic distributions, a common downstream task of the modeling is
grouping the authors of these documents for subsequent analyses. Traditional
models estimate the document groupings and identify user clusters with an
independent procedure. We propose a novel model that expands on the Latent
Dirichlet Allocation by modeling strong dependence among the words in the same
document, with user-level topic distributions. We also simultaneously cluster
users, removing the need for post-hoc cluster estimation and improving topic
estimation by shrinking noisy user-level topic distributions towards typical
values. Our method performs as well as -- or better -- than traditional
approaches to problems arising in short text, and we demonstrate its usefulness
on a dataset of tweets from United States Senators, recovering both meaningful
topics and clusters that reflect partisan ideology.
Related papers
- CAST: Corpus-Aware Self-similarity Enhanced Topic modelling [16.562349140796115]
We introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method.
We find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words.
Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model's ability to handle noisy data.
arXiv Detail & Related papers (2024-10-19T15:27:11Z) - Interactive Topic Models with Optimal Transport [75.26555710661908]
We present EdTM, as an approach for label name supervised topic modeling.
EdTM models topic modeling as an assignment problem while leveraging LM/LLM based document-topic affinities.
arXiv Detail & Related papers (2024-06-28T13:57:27Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - Topics in the Haystack: Extracting and Evaluating Topics beyond
Coherence [0.0]
We propose a method that incorporates a deeper understanding of both sentence and document themes.
This allows our model to detect latent topics that may include uncommon words or neologisms.
We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task.
arXiv Detail & Related papers (2023-03-30T12:24:25Z) - Improved Topic modeling in Twitter through Community Pooling [0.0]
Twitter posts are short and often less coherent than other text documents.
We propose a new pooling scheme for topic modeling in Twitter, which groups tweets whose authors belong to the same community.
Results show that our Community polling method outperformed other methods on the majority of metrics in two heterogeneous datasets.
arXiv Detail & Related papers (2021-12-20T17:05:32Z) - A Proposition-Level Clustering Approach for Multi-Document Summarization [82.4616498914049]
We revisit the clustering approach, grouping together propositions for more precise information alignment.
Our method detects salient propositions, clusters them into paraphrastic clusters, and generates a representative sentence for each cluster by fusing its propositions.
Our summarization method improves over the previous state-of-the-art MDS method in the DUC 2004 and TAC 2011 datasets.
arXiv Detail & Related papers (2021-12-16T10:34:22Z) - Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and
Context-Aware Auto-Encoders [59.038157066874255]
We propose a novel framework called RankAE to perform chat summarization without employing manually labeled data.
RankAE consists of a topic-oriented ranking strategy that selects topic utterances according to centrality and diversity simultaneously.
A denoising auto-encoder is designed to generate succinct but context-informative summaries based on the selected utterances.
arXiv Detail & Related papers (2020-12-14T07:31:17Z) - A Framework for Authorial Clustering of Shorter Texts in Latent Semantic
Spaces [4.18804572788063]
Authorial clustering involves grouping documents written by the same author or team of authors without any prior positive examples of an author's writing style or thematic preferences.
We propose a high-level framework which utilizes a compact data representation in a latent feature space derived with non-parametric topic modeling.
We report on experiments with 120 collections in three languages and two genres and show that the topic-based latent feature space provides a promising level of performance.
arXiv Detail & Related papers (2020-11-30T17:39:44Z) - Weakly-Supervised Aspect-Based Sentiment Analysis via Joint
Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis.
We learn sentiment, aspect> joint topic embeddings in the word embedding space.
We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z) - Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for
Fast and Good Topics too! [5.819224524813161]
We propose an alternative way to obtain topics: clustering pre-trained word embeddings while incorporating document information for weighted clustering and reranking top words.
The best performing combination for our approach performs as well as classical topic models, but with lower runtime and computational complexity.
arXiv Detail & Related papers (2020-04-30T16:18:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.