Related papers: A diverse Multilingual News Headlines Dataset from around the World

A diverse Multilingual News Headlines Dataset from around the World

URL: http://arxiv.org/abs/2403.19352v1
Date: Thu, 28 Mar 2024 12:08:39 GMT
Title: A diverse Multilingual News Headlines Dataset from around the World
Authors: Felix Leeb, Bernhard Schölkopf,
Abstract summary: Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide. It serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles.
Score: 57.37355895609648
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide with English translations of all articles included. Designed for natural language processing and media studies, it serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles, for example, to analyze global news coverage and cultural narratives. As a simple demonstration of the analyses facilitated by this dataset, we use a basic procedure using a TF-IDF weighted similarity metric to group articles into clusters about the same event. We then visualize the \emph{event signatures} of the event showing articles of which languages appear over time, revealing intuitive features based on the proximity of the event and unexpectedness of the event. The dataset is available on \href{https://www.kaggle.com/datasets/felixludos/babel-briefings}{Kaggle} and \href{https://huggingface.co/datasets/felixludos/babel-briefings}{HuggingFace} with accompanying \href{https://github.com/felixludos/babel-briefings}{GitHub} code.

Related papers

CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian, Polish, Russian, and English [53.32175252285023]
Cross-lingual news comparison offers a promising approach to verify information.<n>Existing datasets for cross-lingual news analysis were manually curated by journalists and experts.<n>We introduce a scalable, explainable crowdsourcing pipeline for cross-lingual news similarity assessment.
arXiv Detail & Related papers (2025-10-22T14:23:50Z)
A Dataset for Analysing News Framing in Chinese Media [0.847791472364259]
This study introduces the first Chinese News Framing dataset, to be used as either a stand-alone dataset or a supplementary resource to the SemEval-2023 task 3 dataset. We detail its creation and we run baseline experiments to highlight the need for such a dataset and create benchmarks for future research. For the Chinese language, we obtain an F1-micro (the performance metric for SemEval task 3, subtask 2) score of 0.719 using only samples from our Chinese News Framing dataset and a score of 0.753 when we augment the SemEval dataset with Chinese news framing samples.
arXiv Detail & Related papers (2025-03-06T13:55:33Z)
The 2021 Tokyo Olympics Multilingual News Article Dataset [0.9749638953163389]
A total of 10,940 news articles were gathered from 1,918 different publishers covering 1,350 sub-events of the 2021 Olympics. These articles are written in nine languages from different language families and in different scripts. The development of this dataset aims to provide a resource for evaluating the performance of multilingual news clustering algorithms.
arXiv Detail & Related papers (2025-02-10T16:38:03Z)
A Mixed-Language Multi-Document News Summarization Dataset and a Graphs-Based Extract-Generate Model [15.596156608713347]
In real-world scenarios, news about an international event often involves multiple documents in different languages. We construct a mixed-language multi-document news summarization dataset (MLMD-news) This dataset contains four different languages and 10,992 source document cluster and target summary pairs.
arXiv Detail & Related papers (2024-10-13T08:15:33Z)
Automatic Data Retrieval for Cross Lingual Summarization [4.759360739268894]
Cross-lingual summarization involves the summarization of text written in one language to a different one. In this work, we aim to perform cross-lingual summarization from English to Hindi.
arXiv Detail & Related papers (2023-12-22T09:13:24Z)
$\mu$PLAN: Summarizing using a Content Plan as Cross-Lingual Bridge [72.64847925450368]
Cross-lingual summarization consists of generating a summary in one language given an input document in a different language. This work presents $mu$PLAN, an approach to cross-lingual summarization that uses an intermediate planning step as a cross-lingual bridge.
arXiv Detail & Related papers (2023-05-23T16:25:21Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks. This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query. We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z)
Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages. We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model. We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z)
Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language. The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German. We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z)
CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+ Language Pairs [27.574815708395203]
CrossSum is a large-scale cross-lingual summarization dataset comprising 1.68 million article-summary samples in 1,500+ language pairs. We create CrossSum by aligning parallel articles written in different languages via cross-lingual retrieval from a multilingual abstractive summarization dataset. We propose a multistage data sampling algorithm to effectively train a cross-lingual summarization model capable of summarizing an article in any target language.
arXiv Detail & Related papers (2021-12-16T11:40:36Z)
MLSUM: The Multilingual Summarization Corpus [29.943949944682196]
MLSUM is the first large-scale MultiLingual SUMmarization dataset. It contains 1.5M+ article/summary pairs in five different languages.
arXiv Detail & Related papers (2020-04-30T15:58:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.