The Shmoop Corpus: A Dataset of Stories with Loosely Aligned Summaries
- URL: http://arxiv.org/abs/1912.13082v2
- Date: Wed, 1 Jan 2020 16:06:48 GMT
- Title: The Shmoop Corpus: A Dataset of Stories with Loosely Aligned Summaries
- Authors: Atef Chaudhury, Makarand Tapaswi, Seung Wook Kim, Sanja Fidler
- Abstract summary: We introduce the Shmoop Corpus: a dataset of 231 stories paired with detailed multi-paragraph summaries for each individual chapter.
From the corpus, we construct a set of common NLP tasks, including Cloze-form question answering and a simplified form of abstractive summarization.
We believe that the unique structure of this corpus provides an important foothold towards making machine story comprehension more approachable.
- Score: 72.48439126769627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding stories is a challenging reading comprehension problem for
machines as it requires reading a large volume of text and following long-range
dependencies. In this paper, we introduce the Shmoop Corpus: a dataset of 231
stories that are paired with detailed multi-paragraph summaries for each
individual chapter (7,234 chapters), where the summary is chronologically
aligned with respect to the story chapter. From the corpus, we construct a set
of common NLP tasks, including Cloze-form question answering and a simplified
form of abstractive summarization, as benchmarks for reading comprehension on
stories. We then show that the chronological alignment provides a strong
supervisory signal that learning-based methods can exploit leading to
significant improvements on these tasks. We believe that the unique structure
of this corpus provides an important foothold towards making machine story
comprehension more approachable.
Related papers
- Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues
and Documents [13.755637074366813]
SummN is a simple, flexible, and effective multi-stage framework for input texts longer than the maximum context lengths of typical pretrained LMs.
It can process input text of arbitrary length by adjusting the number of stages while keeping the LM context size fixed.
Our experiments demonstrate that SummN significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2021-10-16T06:19:54Z) - Narrative Incoherence Detection [76.43894977558811]
We propose the task of narrative incoherence detection as a new arena for inter-sentential semantic understanding.
Given a multi-sentence narrative, decide whether there exist any semantic discrepancies in the narrative flow.
arXiv Detail & Related papers (2020-12-21T07:18:08Z) - Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical
Supervision from Extractive Summaries [46.183289748907804]
We propose SOE, a pipelined system that outlines, outlining and elaborating for long text generation.
SOE produces long texts with significantly better quality, along with faster convergence speed.
arXiv Detail & Related papers (2020-10-14T13:22:20Z) - Multi-View Sequence-to-Sequence Models with Conversational Structure for
Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP.
This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations.
Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z) - Document Modeling with Graph Attention Networks for Multi-grained
Machine Reading Comprehension [127.3341842928421]
Natural Questions is a new challenging machine reading comprehension benchmark.
It has two-grained answers, which are a long answer (typically a paragraph) and a short answer (one or more entities inside the long answer)
Existing methods treat these two sub-tasks individually during training while ignoring their dependencies.
We present a novel multi-grained machine reading comprehension framework that focuses on modeling documents at their hierarchical nature.
arXiv Detail & Related papers (2020-05-12T14:20:09Z) - Exploring Content Selection in Summarization of Novel Chapters [19.11830806780343]
We present a new summarization task, generating summaries of novel chapters using summary/chapter pairs from online study guides.
This is a harder task than the news summarization task, given the chapter length as well as the extreme paraphrasing and generalization found in the summaries.
We focus on extractive summarization, which requires the creation of a gold-standard set of extractive summaries.
arXiv Detail & Related papers (2020-05-04T20:45:39Z) - Screenplay Summarization Using Latent Narrative Structure [78.45316339164133]
We propose to explicitly incorporate the underlying structure of narratives into general unsupervised and supervised extractive summarization models.
We formalize narrative structure in terms of key narrative events (turning points) and treat it as latent in order to summarize screenplays.
Experimental results on the CSI corpus of TV screenplays, which we augment with scene-level summarization labels, show that latent turning points correlate with important aspects of a CSI episode.
arXiv Detail & Related papers (2020-04-27T11:54:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.