Simplifying Multilingual News Clustering Through Projection From a
Shared Space
- URL: http://arxiv.org/abs/2204.13418v1
- Date: Thu, 28 Apr 2022 11:32:49 GMT
- Title: Simplifying Multilingual News Clustering Through Projection From a
Shared Space
- Authors: Jo\~ao Santos, Afonso Mendes and Sebasti\~ao Miranda
- Abstract summary: The task of organizing and clustering multilingual news articles for media monitoring is essential to follow news stories in real time.
Most approaches to this task focus on high-resource languages (mostly English), with low-resource languages being disregarded.
We present a much simpler online system that is able to cluster an incoming stream of documents without depending on language-specific features.
- Score: 0.39560040546164016
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of organizing and clustering multilingual news articles for media
monitoring is essential to follow news stories in real time. Most approaches to
this task focus on high-resource languages (mostly English), with low-resource
languages being disregarded. With that in mind, we present a much simpler
online system that is able to cluster an incoming stream of documents without
depending on language-specific features. We empirically demonstrate that the
use of multilingual contextual embeddings as the document representation
significantly improves clustering quality. We challenge previous crosslingual
approaches by removing the precondition of building monolingual clusters. We
model the clustering process as a set of linear classifiers to aggregate
similar documents, and correct closely-related multilingual clusters through
merging in an online fashion. Our system achieves state-of-the-art results on a
multilingual news stream clustering dataset, and we introduce a new evaluation
for zero-shot news clustering in multiple languages. We make our code available
as open-source.
Related papers
- T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Efficient Spoken Language Recognition via Multilabel Classification [53.662747523872305]
We show that our models obtain competitive results while being orders of magnitude smaller and faster than current state-of-the-art methods.
Our multilabel strategy is more robust to unseen non-target languages compared to multiclass classification.
arXiv Detail & Related papers (2023-06-02T23:04:19Z) - Research on Multilingual News Clustering Based on Cross-Language Word
Embeddings [7.401514098389491]
We train a cross-lingual model through knowledge distillation that can represent sentence-level bilingual texts in both Chinese and English.
We adapt the Single-Pass clustering algorithm for the news context to make it more applicable.
arXiv Detail & Related papers (2023-05-30T09:24:55Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Graph Neural Network Enhanced Language Models for Efficient Multilingual
Text Classification [8.147244878591014]
We propose a multilingual disaster related text classification system which is capable to work under mono, cross and multi lingual scenarios.
Our end-to-end trainable framework combines the versatility of graph neural networks, by applying over the corpus.
We evaluate our framework over total nine English, Non-English and monolingual datasets in mono, cross and multi lingual classification scenarios.
arXiv Detail & Related papers (2022-03-06T09:05:42Z) - Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking [84.50302759362698]
We enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models.
We use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks.
We achieve impressive improvements (> 20% on goal accuracy) on the parallel MultiWoZ dataset and Multilingual WoZ dataset.
arXiv Detail & Related papers (2021-09-28T11:22:38Z) - Cross-lingual Text Classification with Heterogeneous Graph Neural
Network [2.6936806968297913]
Cross-lingual text classification aims at training a classifier on the source language and transferring the knowledge to target languages.
Recent multilingual pretrained language models (mPLM) achieve impressive results in cross-lingual classification tasks.
We propose a simple yet effective method to incorporate heterogeneous information within and across languages for cross-lingual text classification.
arXiv Detail & Related papers (2021-05-24T12:45:42Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Scalable Cross-lingual Document Similarity through Language-specific
Concept Hierarchies [0.0]
This paper presents an unsupervised document similarity algorithm that does not require parallel or comparable corpora.
The algorithm annotates topics automatically created from documents in a single language with cross-lingual labels.
Experiments performed on the English, Spanish and French editions of JCR-Acquis corpora reveal promising results on classifying and sorting documents by similar content.
arXiv Detail & Related papers (2020-12-15T10:42:40Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Investigating an approach for low resource language dataset creation,
curation and classification: Setswana and Sepedi [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We also create a news topic classification task.
We investigate an approach on data augmentation, better suited to low resource languages.
arXiv Detail & Related papers (2020-02-18T13:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.