A Mixed-Language Multi-Document News Summarization Dataset and a Graphs-Based Extract-Generate Model
- URL: http://arxiv.org/abs/2410.09773v1
- Date: Sun, 13 Oct 2024 08:15:33 GMT
- Title: A Mixed-Language Multi-Document News Summarization Dataset and a Graphs-Based Extract-Generate Model
- Authors: Shengxiang Gao, Fang nan, Yongbing Zhang, Yuxin Huang, Kaiwen Tan, Zhengtao Yu,
- Abstract summary: In real-world scenarios, news about an international event often involves multiple documents in different languages.
We construct a mixed-language multi-document news summarization dataset (MLMD-news)
This dataset contains four different languages and 10,992 source document cluster and target summary pairs.
- Score: 15.596156608713347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing research on news summarization primarily focuses on single-language single-document (SLSD), single-language multi-document (SLMD) or cross-language single-document (CLSD). However, in real-world scenarios, news about a international event often involves multiple documents in different languages, i.e., mixed-language multi-document (MLMD). Therefore, summarizing MLMD news is of great significance. However, the lack of datasets for MLMD news summarization has constrained the development of research in this area. To fill this gap, we construct a mixed-language multi-document news summarization dataset (MLMD-news), which contains four different languages and 10,992 source document cluster and target summary pairs. Additionally, we propose a graph-based extract-generate model and benchmark various methods on the MLMD-news dataset and publicly release our dataset and code\footnote[1]{https://github.com/Southnf9/MLMD-news}, aiming to advance research in summarization within MLMD scenarios.
Related papers
- A diverse Multilingual News Headlines Dataset from around the World [57.37355895609648]
Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide.
It serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles.
arXiv Detail & Related papers (2024-03-28T12:08:39Z) - Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles [136.84278943588652]
We propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event.
To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm.
The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference.
arXiv Detail & Related papers (2023-09-17T20:28:17Z) - MT4CrossOIE: Multi-stage Tuning for Cross-lingual Open Information
Extraction [38.88339164947934]
Cross-lingual open information extraction aims to extract structured information from raw text across multiple languages.
Previous work uses a shared cross-lingual pre-trained model to handle the different languages but underuses the potential of the language-specific representation.
We propose an effective multi-stage tuning framework called MT4CrossIE, designed for enhancing cross-lingual open information extraction.
arXiv Detail & Related papers (2023-08-12T12:38:10Z) - MUTANT: A Multi-sentential Code-mixed Hinglish Dataset [16.14337612590717]
We propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles.
As a use case, we leverage multilingual articles and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset.
The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs.
arXiv Detail & Related papers (2023-02-23T04:04:18Z) - Modeling Sequential Sentence Relation to Improve Cross-lingual Dense
Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM)
MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document.
To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z) - EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form
Summarization in the Legal Domain [2.4815579733050157]
We propose a novel dataset, called EUR-Lex-Sum, based on manually curated document summaries of legal acts from the European Union law platform (EUR-Lex)
Documents and their respective summaries exist as cross-lingual paragraph-aligned data in several of the 24 official European languages.
We obtain up to 1,500 document/summary pairs per language, including a subset of 375 cross-lingually aligned legal acts with texts available in all 24 languages.
arXiv Detail & Related papers (2022-10-24T17:58:59Z) - WikiMulti: a Corpus for Cross-Lingual Summarization [5.566656105144887]
Cross-lingual summarization is the task to produce a summary in one language for a source document in a different language.
We introduce WikiMulti - a new dataset for cross-lingual summarization based on Wikipedia articles in 15 languages.
arXiv Detail & Related papers (2022-04-23T16:47:48Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - MLSUM: The Multilingual Summarization Corpus [29.943949944682196]
MLSUM is the first large-scale MultiLingual SUMmarization dataset.
It contains 1.5M+ article/summary pairs in five different languages.
arXiv Detail & Related papers (2020-04-30T15:58:34Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.