Analyzing Political Text at Scale with Online Tensor LDA
- URL: http://arxiv.org/abs/2511.07809v1
- Date: Wed, 12 Nov 2025 01:20:12 GMT
- Title: Analyzing Political Text at Scale with Online Tensor LDA
- Authors: Sara Kangaslahti, Danny Ebanks, Jean Kossaifi, Anqi Liu, R. Michael Alvarez, Animashree Anandkumar,
- Abstract summary: This paper proposes a topic modeling method that scales linearly to billions of documents.<n>We show that this method is computationally and memory efficient (achieving speeds over 3-4x those of prior parallelized Latent Dirichlet Allocation (LDA) methods)<n>We perform two real-world, large-scale new studies of interest to political scientists.
- Score: 53.16930342547758
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes a topic modeling method that scales linearly to billions of documents. We make three core contributions: i) we present a topic modeling method, Tensor Latent Dirichlet Allocation (TLDA), that has identifiable and recoverable parameter guarantees and sample complexity guarantees for large data; ii) we show that this method is computationally and memory efficient (achieving speeds over 3-4x those of prior parallelized Latent Dirichlet Allocation (LDA) methods), and that it scales linearly to text datasets with over a billion documents; iii) we provide an open-source, GPU-based implementation, of this method. This scaling enables previously prohibitive analyses, and we perform two real-world, large-scale new studies of interest to political scientists: we provide the first thorough analysis of the evolution of the #MeToo movement through the lens of over two years of Twitter conversation and a detailed study of social media conversations about election fraud in the 2020 presidential election. Thus this method provides social scientists with the ability to study very large corpora at scale and to answer important theoretically-relevant questions about salient issues in near real-time.
Related papers
- HICode: Hierarchical Inductive Coding with LLMs [3.0013352260516744]
We develop HICode, a two-part pipeline that first inductively generates labels directly from analysis data and then hierarchically clusters them to surface emergent themes.<n>We validate this approach across three diverse datasets by measuring alignment with human-constructed themes and demonstrating its robustness through automated and human evaluations.
arXiv Detail & Related papers (2025-09-22T16:07:11Z) - Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study [55.09905978813599]
Large Language Models (LLMs) hold promise in automating data analysis tasks.<n>Yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios.<n>In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs.
arXiv Detail & Related papers (2025-06-24T17:04:23Z) - A Novel, Human-in-the-Loop Computational Grounded Theory Framework for Big Social Data [8.695136686770772]
We argue that confidence in the credibility and robustness of results depends on adopting a 'human-in-the-loop' methodology.<n>We propose a novel methodological framework for Computational Grounded Theory (CGT) that supports the analysis of large qualitative datasets.
arXiv Detail & Related papers (2025-06-06T13:43:12Z) - Capturing research literature attitude towards Sustainable Development Goals: an LLM-based topic modeling approach [0.7806050661713976]
The Sustainable Development Goals were formulated by the United Nations in 2015 to address these global challenges by 2030.
Natural language processing techniques can help uncover discussions on SDGs within research literature.
We propose a completely automated pipeline to fetch content from the Scopus database and prepare datasets dedicated to five groups of SDGs.
arXiv Detail & Related papers (2024-11-05T09:37:23Z) - Enhancing literature review with LLM and NLP methods. Algorithmic trading case [0.0]
This study utilizes machine learning algorithms to analyze and organize knowledge in the field of algorithmic trading.
By filtering a dataset of 136 million research papers, we identified 14,342 relevant articles published between 1956 and Q1 2020.
arXiv Detail & Related papers (2024-10-23T13:37:27Z) - Integrating Planning into Single-Turn Long-Form Text Generation [66.08871753377055]
We propose to use planning to generate long form content.
Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning.
Our experiments demonstrate on two datasets from different domains, that LLMs fine-tuned with the auxiliary task generate higher quality documents.
arXiv Detail & Related papers (2024-10-08T17:02:40Z) - From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models [56.9134620424985]
Cross-modal reasoning (CMR) is increasingly recognized as a crucial capability in the progression toward more sophisticated artificial intelligence systems.
The recent trend of deploying Large Language Models (LLMs) to tackle CMR tasks has marked a new mainstream of approaches for enhancing their effectiveness.
This survey offers a nuanced exposition of current methodologies applied in CMR using LLMs, classifying these into a detailed three-tiered taxonomy.
arXiv Detail & Related papers (2024-09-19T02:51:54Z) - QuaLLM: An LLM-based Framework to Extract Quantitative Insights from Online Forums [10.684484559041284]
This study introduces QuaLLM, a novel framework to analyze and extract quantitative insights from text data on online forums.<n>We applied this framework to analyze over one million comments from two of Reddit's rideshare worker communities.<n>We uncover significant worker concerns regarding AI and algorithmic platform decisions, responding to regulatory calls about worker insights.
arXiv Detail & Related papers (2024-05-08T18:20:03Z) - A Survey on Large-scale Machine Learning [67.6997613600942]
Machine learning can provide deep insights into data, allowing machines to make high-quality predictions.
Most sophisticated machine learning approaches suffer from huge time costs when operating on large-scale data.
Large-scale Machine Learning aims to learn patterns from big data with comparable performance efficiently.
arXiv Detail & Related papers (2020-08-10T06:07:52Z) - A Survey on Text Classification: From Shallow to Deep Learning [83.47804123133719]
The last decade has seen a surge of research in this area due to the unprecedented success of deep learning.
This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021.
We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification.
arXiv Detail & Related papers (2020-08-02T00:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.