Echoes from Alexandria: A Large Resource for Multilingual Book
Summarization
- URL: http://arxiv.org/abs/2306.04334v1
- Date: Wed, 7 Jun 2023 11:01:39 GMT
- Title: Echoes from Alexandria: A Large Resource for Multilingual Book
Summarization
- Authors: Alessandro Scir\`e, Simone Conia, Simone Ciciliano, Roberto Navigli
- Abstract summary: "Echoes from Alexandria" is a large resource for multilingual book summarization.
Echoes features three novel datasets: i) Echo-Wiki, for multilingual book summarization, ii) Echo-XSum, for extremely-compressive multilingual book summarization, andiii) Echo-FairySum, for extractive book summarization.
- Score: 99.86355187131349
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, research in text summarization has mainly focused on the
news domain, where texts are typically short and have strong layout features.
The task of full-book summarization presents additional challenges which are
hard to tackle with current resources, due to their limited size and
availability in English only. To overcome these limitations, we present "Echoes
from Alexandria", or in shortened form, "Echoes", a large resource for
multilingual book summarization. Echoes features three novel datasets: i)
Echo-Wiki, for multilingual book summarization, ii) Echo-XSum, for
extremely-compressive multilingual book summarization, and iii) Echo-FairySum,
for extractive book summarization. To the best of our knowledge, Echoes, with
its thousands of books and summaries, is the largest resource, and the first to
be multilingual, featuring 5 languages and 25 language pairs. In addition to
Echoes, we also introduce a new extractive-then-abstractive baseline, and,
supported by our experimental results and manual analysis of the summaries
generated, we argue that this baseline is more suitable for book summarization
than purely-abstractive approaches. We release our resource and software at
https://github.com/Babelscape/echoes-from-alexandria in the hope of fostering
innovative research in multilingual book summarization.
Related papers
- Converging Dimensions: Information Extraction and Summarization through Multisource, Multimodal, and Multilingual Fusion [0.0]
The paper proposes a novel approach to summarization that tackles such challenges by utilizing the strength of multiple sources.
The research progresses beyond conventional, unimodal sources such as text documents and integrates a more diverse range of data, including YouTube playlists, pre-prints, and Wikipedia pages.
arXiv Detail & Related papers (2024-06-19T17:15:47Z) - Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.47046536073682]
We present a review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature.
We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
arXiv Detail & Related papers (2024-04-07T11:52:44Z) - XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation
in Low Resource Languages [11.581072296148031]
Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages.
We propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text.
arXiv Detail & Related papers (2023-03-22T04:52:43Z) - LoRaLay: A Multilingual and Multimodal Dataset for Long Range and
Layout-Aware Summarization [19.301567079372436]
Text Summarization is a popular task and an active area of research for the Natural Language Processing community.
All publicly available summarization datasets only provide plain text content.
We present LoRaLay, a collection of datasets for long-range summarization with accompanying visual/Lay information.
arXiv Detail & Related papers (2023-01-26T18:50:54Z) - Recitation-Augmented Language Models [85.30591349383849]
We show that RECITE is a powerful paradigm for knowledge-intensive NLP tasks.
Specifically, we show that by utilizing recitation as the intermediate step, a recite-and-answer scheme can achieve new state-of-the-art performance.
arXiv Detail & Related papers (2022-10-04T00:49:20Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Klexikon: A German Dataset for Joint Summarization and Simplification [2.931632009516441]
We create a new dataset for joint Text Simplification and Summarization based on German Wikipedia and the German children's lexicon "Klexikon"
We highlight the summarization aspect and provide statistical evidence that this resource is well suited to simplification as well.
arXiv Detail & Related papers (2022-01-18T18:50:43Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.