Stick-Breaking Embedded Topic Model with Continuous Optimal Transport for Online Analysis of Document Streams
- URL: http://arxiv.org/abs/2510.18786v1
- Date: Tue, 21 Oct 2025 16:40:14 GMT
- Title: Stick-Breaking Embedded Topic Model with Continuous Optimal Transport for Online Analysis of Document Streams
- Authors: Federica Granese, Serena Villata, Charles Bouveyron,
- Abstract summary: SB-SETM is an innovative model extending the Embedded Topic Model (ETM) to process data streams by merging models formed on successive partial document batches.<n> Numerical experiments show SB-SETM outperforming baselines on simulated scenarios.<n>We extensively test it on a real-world corpus of news articles covering the Russian-Ukrainian war throughout 2022-2023.
- Score: 9.41487883751588
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online topic models are unsupervised algorithms to identify latent topics in data streams that continuously evolve over time. Although these methods naturally align with real-world scenarios, they have received considerably less attention from the community compared to their offline counterparts, due to specific additional challenges. To tackle these issues, we present SB-SETM, an innovative model extending the Embedded Topic Model (ETM) to process data streams by merging models formed on successive partial document batches. To this end, SB-SETM (i) leverages a truncated stick-breaking construction for the topic-per-document distribution, enabling the model to automatically infer from the data the appropriate number of active topics at each timestep; and (ii) introduces a merging strategy for topic embeddings based on a continuous formulation of optimal transport adapted to the high dimensionality of the latent topic space. Numerical experiments show SB-SETM outperforming baselines on simulated scenarios. We extensively test it on a real-world corpus of news articles covering the Russian-Ukrainian war throughout 2022-2023.
Related papers
- Edit-Based Flow Matching for Temporal Point Processes [51.33476564706644]
temporal point processes (TPPs) are a fundamental tool for modeling event sequences in continuous time.<n>Recent non-autoregressive, diffusion-style models mitigate these issues by jointly interpolating between noise and data.<n>We introduce an Edit Flow process for TPPs that transports noise to data via insert, delete, and substitute edit operations.
arXiv Detail & Related papers (2025-10-07T15:44:12Z) - ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction [84.90394416593624]
Agentic task-solving with Large Language Models (LLMs) requires multi-turn, multi-step interactions.<n>Existing simulation-based data generation methods rely heavily on costly autoregressive interactions between multiple agents.<n>We propose a novel Non-Autoregressive Iterative Generation framework, called ToolACE-MT, for constructing high-quality multi-turn agentic dialogues.
arXiv Detail & Related papers (2025-08-18T07:38:23Z) - Merging Embedded Topics with Optimal Transport for Online Topic Modeling on Data Streams [8.618304780146348]
StreamETM builds on the Embedded Topic Model (ETM) to handle data streams.<n>An online change point detection algorithm is employed to identify shifts in topics over time.<n> Numerical experiments on simulated and real-world data show StreamETM outperforming competitors.
arXiv Detail & Related papers (2025-04-10T13:04:56Z) - MITA: Bridging the Gap between Model and Data for Test-time Adaptation [68.62509948690698]
Test-Time Adaptation (TTA) has emerged as a promising paradigm for enhancing the generalizability of models.
We propose Meet-In-The-Middle based MITA, which introduces energy-based optimization to encourage mutual adaptation of the model and data from opposing directions.
arXiv Detail & Related papers (2024-10-12T07:02:33Z) - Enhancing Short-Text Topic Modeling with LLM-Driven Context Expansion and Prefix-Tuned VAEs [25.915607750636333]
We propose a novel approach that leverages large language models (LLMs) to extend short texts into more detailed sequences before applying topic modeling.
Our method significantly improves short-text topic modeling performance, as demonstrated by extensive experiments on real-world datasets with extreme data sparsity.
arXiv Detail & Related papers (2024-10-04T01:28:56Z) - FASTopic: Pretrained Transformer is a Fast, Adaptive, Stable, and Transferable Topic Model [76.509837704596]
We propose FASTopic, a fast, adaptive, stable, and transferable topic model.
We use Dual Semantic-relation Reconstruction (DSR) to model latent topics.
We also propose Embedding Transport Plan (ETP) to regularize semantic relations as optimal transport plans.
arXiv Detail & Related papers (2024-05-28T09:06:38Z) - Controllable Topic-Focused Abstractive Summarization [57.8015120583044]
Controlled abstractive summarization focuses on producing condensed versions of a source article to cover specific aspects.
This paper presents a new Transformer-based architecture capable of producing topic-focused summaries.
arXiv Detail & Related papers (2023-11-12T03:51:38Z) - Recurrent Coupled Topic Modeling over Sequential Documents [33.35324412209806]
We show that a current topic evolves from all prior topics with corresponding coupling weights, forming the multi-topic-thread evolution.
A new solution with a set of novel data augmentation techniques is proposed, which successfully discomposes the multi-couplings between evolving topics.
A novel Gibbs sampler with a backward-forward filter algorithm efficiently learns latent timeevolving parameters in a closed-form.
arXiv Detail & Related papers (2021-06-23T08:58:13Z) - Neural Topic Modeling with Cycle-Consistent Adversarial Training [17.47328718035538]
We propose Topic Modeling with Cycle-consistent Adversarial Training (ToMCAT) and its supervised version sToMCAT.
ToMCAT employs a generator network to interpret topics and an encoder network to infer document topics.
SToMCAT extends ToMCAT by incorporating document labels into the topic modeling process to help discover more coherent topics.
arXiv Detail & Related papers (2020-09-29T12:41:27Z) - Modeling Topical Relevance for Multi-Turn Dialogue Generation [61.87165077442267]
We propose a new model, named STAR-BTM, to tackle the problem of topic drift in multi-turn dialogue.
The Biterm Topic Model is pre-trained on the whole training dataset. Then, the topic level attention weights are computed based on the topic representation of each context.
Experimental results on both Chinese customer services data and English Ubuntu dialogue data show that STAR-BTM significantly outperforms several state-of-the-art methods.
arXiv Detail & Related papers (2020-09-27T03:33:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.