Related papers: BookWorm: A Dataset for Character Description and Analysis

BookWorm: A Dataset for Character Description and Analysis

URL: http://arxiv.org/abs/2410.10372v1
Date: Mon, 14 Oct 2024 10:55:58 GMT
Title: BookWorm: A Dataset for Character Description and Analysis
Authors: Argyrios Papoudakis, Mirella Lapata, Frank Keller,
Abstract summary: We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation. We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses. Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks.
Score: 59.186325346763184
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Characters are at the heart of every story, driving the plot and engaging readers. In this study, we explore the understanding of characters in full-length books, which contain complex narratives and numerous interacting characters. We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation, including character development, personality, and social context. We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses. Using this dataset, we evaluate state-of-the-art long-context models in zero-shot and fine-tuning settings, utilizing both retrieval-based and hierarchical processing for book-length inputs. Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks. Additionally, fine-tuned models using coreference-based retrieval produce the most factual descriptions, as measured by fact- and entailment-based metrics. We hope our dataset, experiments, and analysis will inspire further research in character-based narrative understanding.

Related papers

Show, Don't Tell: Uncovering Implicit Character Portrayal using LLMs [19.829683714192615]
We introduce LIIPA, a framework for prompting large language models to uncover implicit character portrayals. We find that LIIPA outperforms existing approaches, and is more robust to increasing character counts. Our work demonstrates the potential benefits of using LLMs to analyze complex characters.
arXiv Detail & Related papers (2024-12-05T19:46:53Z)
CHATTER: A Character Attribution Dataset for Narrative Understanding [31.540540919042154]
We validate a subset of Chatter, called ChatterEval, using human annotations to serve as an evaluation benchmark for the character attribution task in movie scripts. ChatterEval assesses narrative understanding and the long-context modeling capacity of language models.
arXiv Detail & Related papers (2024-11-07T22:37:30Z)
Generating Visual Stories with Grounded and Coreferent Characters [63.07511918366848]
We present the first model capable of predicting visual stories with consistently grounded and coreferent character mentions.<n>Our model is finetuned on a new dataset which we build on top of the widely used VIST benchmark.<n>We also propose new evaluation metrics to measure the richness of characters and coreference in stories.
arXiv Detail & Related papers (2024-09-20T14:56:33Z)
CHIRON: Rich Character Representations in Long-Form Narratives [98.273323001781]
We propose CHIRON, a new character sheet' based representation that organizes and filters textual information about characters. We validate CHIRON via the downstream task of masked-character prediction, where our experiments show CHIRON is better and more flexible than comparable summary-based baselines. metrics derived from CHIRON can be used to automatically infer character-centricity in stories, and that these metrics align with human judgments.
arXiv Detail & Related papers (2024-06-14T17:23:57Z)
Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z)
Branching Narratives: Character Decision Points Detection [13.615681132633561]
We propose a novel dataset based on CYOA-like games graphs to be used as a benchmark for such a task. We show how such a model can be applied to the existing text to produce linear segments divided by potential branching points.
arXiv Detail & Related papers (2024-05-12T13:36:07Z)
Personality Understanding of Fictional Characters during Book Reading [81.68515671674301]
We present the first labeled dataset PersoNet for this problem. Our novel annotation strategy involves annotating user notes from online reading apps as a proxy for the original books. Experiments and human studies indicate that our dataset construction is both efficient and accurate.
arXiv Detail & Related papers (2023-05-17T12:19:11Z)
Detecting and Grounding Important Characters in Visual Stories [18.870236356616907]
We introduce the VIST-Character dataset, which provides rich character-centric annotations. Based on this dataset, we propose two new tasks: important character detection and character grounding in visual stories. We develop simple, unsupervised models based on distributional similarity and pre-trained vision-and-language models.
arXiv Detail & Related papers (2023-03-30T18:24:06Z)
PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors. We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z)
"Let Your Characters Tell Their Story": A Dataset for Character-Centric Narrative Understanding [31.803481510886378]
We present LiSCU -- a new dataset of literary pieces and their summaries paired with descriptions of characters that appear in them. We also introduce two new tasks on LiSCU: Character Identification and Character Description Generation. Our experiments with several pre-trained language models adapted for these tasks demonstrate that there is a need for better models of narrative comprehension.
arXiv Detail & Related papers (2021-09-12T06:12:55Z)
ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets. The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.