Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling
- URL: http://arxiv.org/abs/2512.11635v1
- Date: Fri, 12 Dec 2025 15:15:02 GMT
- Title: Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling
- Authors: Keerthana Murugaraj, Salima Lamsiyah, Marten During, Martin Theobald,
- Abstract summary: Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety.<n>We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse.<n>This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time.
- Score: 1.4322802933929257
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Extracting coherent and human-understandable themes from large collections of unstructured historical newspaper archives presents significant challenges due to topic evolution, Optical Character Recognition (OCR) noise, and the sheer volume of text. Traditional topic-modeling methods, such as Latent Dirichlet Allocation (LDA), often fall short in capturing the complexity and dynamic nature of discourse in historical texts. To address these limitations, we employ BERTopic. This neural topic-modeling approach leverages transformerbased embeddings to extract and classify topics, which, despite its growing popularity, still remains underused in historical research. Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety. We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse. This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time. Our study demonstrates the scalability and contextual sensitivity of BERTopic as an alternative to traditional approaches, offering richer insights into historical discourses extracted from newspaper archives. These findings contribute to historical, nuclear, and social-science research while reflecting on current limitations and proposing potential directions for future work.
Related papers
- DiscoSum: Discourse-aware News Summarization [79.4884227574627]
We introduce a novel approach to integrating discourse structure into summarization processes.<n>We present a novel summarization dataset where news articles are summarized multiple times in different ways across different social media platforms.<n>We develop a novel news discourse schema to describe summarization structures and a novel algorithm, DiscoSum, which employs beam search technique for structure-aware summarization.
arXiv Detail & Related papers (2025-06-07T22:00:30Z) - Talking Point based Ideological Discourse Analysis in News Events [62.18747509565779]
We propose a framework motivated by the theory of ideological discourse analysis to analyze news articles related to real-world events.<n>Our framework represents the news articles using a relational structure - talking points, which captures the interaction between entities, their roles, and media frames along with a topic of discussion.<n>We evaluate our framework's ability to generate these perspectives through automated tasks - ideology and partisan classification tasks, supplemented by human validation.
arXiv Detail & Related papers (2025-04-10T02:52:34Z) - A Large Language Model Guided Topic Refinement Mechanism for Short Text Modeling [10.589126787499973]
Existing topic models often struggle to accurately capture the underlying semantic patterns of short texts.<n>This paper introduces a novel model-agnostic mechanism, termed Topic Refinement.<n>We show that Topic Refinement boosts the topic quality and improves the performance in topic-related text classification tasks.
arXiv Detail & Related papers (2024-03-26T13:50:34Z) - Discovering Latent Themes in Social Media Messaging: A Machine-in-the-Loop Approach Integrating LLMs [22.976609127865732]
We introduce a novel approach to uncovering latent themes in social media messaging.
Our work sheds light on the dynamic nature of social media, revealing the shifts in the thematic focus of messaging in response to real-world events.
arXiv Detail & Related papers (2024-03-15T21:54:00Z) - Recent Advances in Hate Speech Moderation: Multimodality and the Role of Large Models [52.24001776263608]
This comprehensive survey delves into the recent strides in HS moderation.
We highlight the burgeoning role of large language models (LLMs) and large multimodal models (LMMs)
We identify existing gaps in research, particularly in the context of underrepresented languages and cultures.
arXiv Detail & Related papers (2024-01-30T03:51:44Z) - ANTM: An Aligned Neural Topic Model for Exploring Evolving Topics [1.854328133293073]
This paper presents an algorithmic family of dynamic topic models called Aligned Neural Topic Models (ANTM)
ANTM combines novel data mining algorithms to provide a modular framework for discovering evolving topics.
A Python package is developed for researchers and scientists who wish to study the trends and evolving patterns of topics in large-scale textual data.
arXiv Detail & Related papers (2023-02-03T02:31:12Z) - Knowledge-Aware Bayesian Deep Topic Model [50.58975785318575]
We propose a Bayesian generative model for incorporating prior domain knowledge into hierarchical topic modeling.
Our proposed model efficiently integrates the prior knowledge and improves both hierarchical topic discovery and document representation.
arXiv Detail & Related papers (2022-09-20T09:16:05Z) - An NLP approach to quantify dynamic salience of predefined topics in a
text corpus [0.0]
We use natural language processing techniques to quantify how a set of pre-defined topics of interest change over time across a large corpus of text.
We find that given a predefined topic, we can identify and rank sets of terms, or n-grams, that map to those topics and have usage patterns that deviate from a normal baseline.
arXiv Detail & Related papers (2021-08-16T21:00:06Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Topic modelling discourse dynamics in historical newspapers [2.978993130750125]
We apply two families of topic models (LDA and DTM) on a relatively large set of historical newspapers in Finland.
Our case study focuses on newspapers and periodicals published in Finland between 1854 and 1917, but our method can easily be transposed to any diachronic data.
arXiv Detail & Related papers (2020-11-20T14:51:07Z) - Ranking Enhanced Dialogue Generation [77.8321855074999]
How to effectively utilize the dialogue history is a crucial problem in multi-turn dialogue generation.
Previous works usually employ various neural network architectures to model the history.
This paper proposes a Ranking Enhanced Dialogue generation framework.
arXiv Detail & Related papers (2020-08-13T01:49:56Z) - Combining Visual and Textual Features for Semantic Segmentation of
Historical Newspapers [2.5899040911480187]
We introduce a multimodal approach for the semantic segmentation of historical newspapers.
Based on experiments on diachronic Swiss and Luxembourgish newspapers, we investigate the predictive power of visual and textual features.
Results show consistent improvement of multimodal models in comparison to a strong visual baseline.
arXiv Detail & Related papers (2020-02-14T17:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.