Hierarchical Level-Wise News Article Clustering via Multilingual Matryoshka Embeddings
- URL: http://arxiv.org/abs/2506.00277v1
- Date: Fri, 30 May 2025 22:17:18 GMT
- Title: Hierarchical Level-Wise News Article Clustering via Multilingual Matryoshka Embeddings
- Authors: Hans W. A. Hanley, Zakir Durumeric,
- Abstract summary: We present a novel, scalable, interpretable, hierarchical, and multilingual approach to clustering news articles and social media data.<n>We first train multilingual Matryoshka embeddings that can determine story similarity at varying levels of granularity.<n>We develop an efficient hierarchical clustering algorithm that leverages the hierarchical nature of Matryoshka embeddings to identify unique news stories, narratives, and themes.
- Score: 5.161088104035108
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contextual large language model embeddings are increasingly utilized for topic modeling and clustering. However, current methods often scale poorly, rely on opaque similarity metrics, and struggle in multilingual settings. In this work, we present a novel, scalable, interpretable, hierarchical, and multilingual approach to clustering news articles and social media data. To do this, we first train multilingual Matryoshka embeddings that can determine story similarity at varying levels of granularity based on which subset of the dimensions of the embeddings is examined. This embedding model achieves state-of-the-art performance on the SemEval 2022 Task 8 test dataset (Pearson $\rho$ = 0.816). Once trained, we develop an efficient hierarchical clustering algorithm that leverages the hierarchical nature of Matryoshka embeddings to identify unique news stories, narratives, and themes. We conclude by illustrating how our approach can identify and cluster stories, narratives, and overarching themes within real-world news datasets.
Related papers
- CAST: Corpus-Aware Self-similarity Enhanced Topic modelling [16.562349140796115]
We introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method.<n>We find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words.<n>Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model's ability to handle noisy data.
arXiv Detail & Related papers (2024-10-19T15:27:11Z) - Research on Multilingual News Clustering Based on Cross-Language Word
Embeddings [7.401514098389491]
We train a cross-lingual model through knowledge distillation that can represent sentence-level bilingual texts in both Chinese and English.
We adapt the Single-Pass clustering algorithm for the news context to make it more applicable.
arXiv Detail & Related papers (2023-05-30T09:24:55Z) - Topics in Contextualised Attention Embeddings [7.6650522284905565]
Recent work has demonstrated that conducting clustering on the word-level contextual representations from a language model emulates word clusters that are discovered in latent topics of words from Latent Dirichlet Allocation.
The important question is how such topical word clusters are automatically formed, through clustering, in the language model when it has not been explicitly designed to model latent topics.
Using BERT and DistilBERT, we find that the attention framework plays a key role in modelling such word topic clusters.
arXiv Detail & Related papers (2023-01-11T07:26:19Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - MGDoc: Pre-training with Multi-granular Hierarchy for Document Image
Understanding [53.03978356918377]
spatial hierarchical relationships between content at different levels of granularity are crucial for document image understanding tasks.
Existing methods learn features from either word-level or region-level but fail to consider both simultaneously.
We propose MGDoc, a new multi-modal multi-granular pre-training framework that encodes page-level, region-level, and word-level information at the same time.
arXiv Detail & Related papers (2022-11-27T22:47:37Z) - TabLLM: Few-shot Classification of Tabular Data with Large Language
Models [66.03023402174138]
We study the application of large language models to zero-shot and few-shot classification.
We evaluate several serialization methods including templates, table-to-text models, and large language models.
This approach is also competitive with strong traditional baselines like gradient-boosted trees.
arXiv Detail & Related papers (2022-10-19T17:08:13Z) - Simplifying Multilingual News Clustering Through Projection From a
Shared Space [0.39560040546164016]
The task of organizing and clustering multilingual news articles for media monitoring is essential to follow news stories in real time.
Most approaches to this task focus on high-resource languages (mostly English), with low-resource languages being disregarded.
We present a much simpler online system that is able to cluster an incoming stream of documents without depending on language-specific features.
arXiv Detail & Related papers (2022-04-28T11:32:49Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Topic Modeling with Contextualized Word Representation Clusters [8.49454123392354]
Clustering token-level contextualized word representations produces output that shares many similarities with topic models for English text collections.
We evaluate token clusterings trained from several different output layers of popular contextualized language models.
arXiv Detail & Related papers (2020-10-23T19:16:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.