Batch Clustering for Multilingual News Streaming
- URL: http://arxiv.org/abs/2004.08123v1
- Date: Fri, 17 Apr 2020 08:59:13 GMT
- Title: Batch Clustering for Multilingual News Streaming
- Authors: Mathis Linger and Mhamed Hajaiej
- Abstract summary: Large volume of diverse and unorganized information makes reading difficult or almost impossible.
We process articles per batch, looking for monolingual local topics which are then linked across time and languages.
Our system gives monolingual state-of-the-art results on dataset of Spanish and German news and crosslingual state-of-the-art results on English, Spanish and German news.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Nowadays, digital news articles are widely available, published by various
editors and often written in different languages. This large volume of diverse
and unorganized information makes human reading very difficult or almost
impossible. This leads to a need for algorithms able to arrange high amount of
multilingual news into stories. To this purpose, we extend previous works on
Topic Detection and Tracking, and propose a new system inspired from newsLens.
We process articles per batch, looking for monolingual local topics which are
then linked across time and languages. Here, we introduce a novel "replaying"
strategy to link monolingual local topics into stories. Besides, we propose new
fine tuned multilingual embedding using SBERT to create crosslingual stories.
Our system gives monolingual state-of-the-art results on dataset of Spanish and
German news and crosslingual state-of-the-art results on English, Spanish and
German news.
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model [14.39119862985503]
We aim to create a multilingual ALT system with available datasets.
Inspired by architectures that have been proven effective for English ALT, we adapt these techniques to the multilingual scenario.
We evaluate the performance of the multilingual model in comparison to its monolingual counterparts.
arXiv Detail & Related papers (2024-06-25T15:02:32Z) - Cross-Lingual Transfer for Natural Language Inference via Multilingual Prompt Translator [104.63314132355221]
Cross-lingual transfer with prompt learning has shown promising effectiveness.
We propose a novel framework, Multilingual Prompt Translator (MPT)
MPT is more prominent compared with vanilla prompting when transferring to languages quite distinct from source language.
arXiv Detail & Related papers (2024-03-19T03:35:18Z) - Investigating Lexical Sharing in Multilingual Machine Translation for
Indian Languages [8.858671209228536]
We investigate lexical sharing in multilingual machine translation from Hindi, Gujarati, Nepali into English.
We find that transliteration does not give pronounced improvements.
Our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences.
arXiv Detail & Related papers (2023-05-04T23:35:15Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Simplifying Multilingual News Clustering Through Projection From a
Shared Space [0.39560040546164016]
The task of organizing and clustering multilingual news articles for media monitoring is essential to follow news stories in real time.
Most approaches to this task focus on high-resource languages (mostly English), with low-resource languages being disregarded.
We present a much simpler online system that is able to cluster an incoming stream of documents without depending on language-specific features.
arXiv Detail & Related papers (2022-04-28T11:32:49Z) - Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data.
Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z) - Transferring Knowledge Distillation for Multilingual Social Event
Detection [42.663309895263666]
Recently published graph neural networks (GNNs) show promising performance at social event detection tasks.
We present a GNN that incorporates cross-lingual word embeddings for detecting events in multilingual data streams.
Experiments on both synthetic and real-world datasets show the framework to be highly effective at detection in both multilingual data and in languages where training samples are scarce.
arXiv Detail & Related papers (2021-08-06T12:38:42Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Upgrading the Newsroom: An Automated Image Selection System for News
Articles [6.901494425127736]
We propose an automated image selection system to assist photo editors in selecting suitable images for news articles.
The system fuses multiple textual sources extracted from news articles and accepts multilingual inputs.
We extensively experiment with our system on a large-scale text-image database containing multimodal multilingual news articles.
arXiv Detail & Related papers (2020-04-23T20:29:26Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.