Klexikon: A German Dataset for Joint Summarization and Simplification
- URL: http://arxiv.org/abs/2201.07198v1
- Date: Tue, 18 Jan 2022 18:50:43 GMT
- Title: Klexikon: A German Dataset for Joint Summarization and Simplification
- Authors: Dennis Aumiller and Michael Gertz
- Abstract summary: We create a new dataset for joint Text Simplification and Summarization based on German Wikipedia and the German children's lexicon "Klexikon"
We highlight the summarization aspect and provide statistical evidence that this resource is well suited to simplification as well.
- Score: 2.931632009516441
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditionally, Text Simplification is treated as a monolingual translation
task where sentences between source texts and their simplified counterparts are
aligned for training. However, especially for longer input documents,
summarizing the text (or dropping less relevant content altogether) plays an
important role in the simplification process, which is currently not reflected
in existing datasets. Simultaneously, resources for non-English languages are
scarce in general and prohibitive for training new solutions. To tackle this
problem, we pose core requirements for a system that can jointly summarize and
simplify long source documents. We further describe the creation of a new
dataset for joint Text Simplification and Summarization based on German
Wikipedia and the German children's lexicon "Klexikon", consisting of almost
2900 documents. We release a document-aligned version that particularly
highlights the summarization aspect, and provide statistical evidence that this
resource is well suited to simplification as well. Code and data are available
on Github: https://github.com/dennlinger/klexikon
Related papers
- Summarization-based Data Augmentation for Document Classification [16.49709049899731]
We propose a simple yet effective summarization-based data augmentation, SUMMaug, for document classification.
We first obtain easy-to-learn examples for the target document classification task.
We then use the generated pseudo examples to perform curriculum learning.
arXiv Detail & Related papers (2023-12-01T11:34:37Z) - On Context Utilization in Summarization with Large Language Models [83.84459732796302]
Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries.
Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens.
We conduct the first comprehensive study on context utilization and position bias in summarization.
arXiv Detail & Related papers (2023-10-16T16:45:12Z) - MCTS: A Multi-Reference Chinese Text Simplification Dataset [15.080614581458091]
There has been very little research on Chinese text simplification for a long time.
We introduce MCTS, a multi-reference Chinese text simplification dataset.
We evaluate the performance of several unsupervised methods and advanced large language models.
arXiv Detail & Related papers (2023-06-05T11:46:36Z) - DEPLAIN: A German Parallel Corpus with Intralingual Translations into
Plain Language for Sentence and Document Simplification [1.5223905439199599]
This paper presents DEplain, a new dataset of parallel, professionally written and manually aligned simplifications in plain German.
We show that using DEplain to train a transformer-based seq2seq text simplification model can achieve promising results.
We make available the corpus, the adapted alignment methods for German, the web harvester and the trained models here.
arXiv Detail & Related papers (2023-05-30T11:07:46Z) - Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues
and Documents [13.755637074366813]
SummN is a simple, flexible, and effective multi-stage framework for input texts longer than the maximum context lengths of typical pretrained LMs.
It can process input text of arbitrary length by adjusting the number of stages while keeping the LM context size fixed.
Our experiments demonstrate that SummN significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2021-10-16T06:19:54Z) - Document-Level Text Simplification: Dataset, Criteria and Baseline [75.58761130635824]
We define and investigate a new task of document-level text simplification.
Based on Wikipedia dumps, we first construct a large-scale dataset named D-Wikipedia.
We propose a new automatic evaluation metric called D-SARI that is more suitable for the document-level simplification task.
arXiv Detail & Related papers (2021-10-11T08:15:31Z) - Automated News Summarization Using Transformers [4.932130498861987]
We will be presenting a comprehensive comparison of a few transformer architecture based pre-trained models for text summarization.
For analysis and comparison, we have used the BBC news dataset that contains text data that can be used for summarization and human generated summaries.
arXiv Detail & Related papers (2021-04-23T04:22:33Z) - From Standard Summarization to New Tasks and Beyond: Summarization with
Manifold Information [77.89755281215079]
Text summarization is the research area aiming at creating a short and condensed version of the original document.
In real-world applications, most of the data is not in a plain text format.
This paper focuses on the survey of these new summarization tasks and approaches in the real-world application.
arXiv Detail & Related papers (2020-05-10T14:59:36Z) - ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification
Models with Multiple Rewriting Transformations [97.27005783856285]
This paper introduces ASSET, a new dataset for assessing sentence simplification in English.
We show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task.
arXiv Detail & Related papers (2020-05-01T16:44:54Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.