Computational analyses of the topics, sentiments, literariness,
creativity and beauty of texts in a large Corpus of English Literature
- URL: http://arxiv.org/abs/2201.04356v1
- Date: Wed, 12 Jan 2022 08:16:52 GMT
- Title: Computational analyses of the topics, sentiments, literariness,
creativity and beauty of texts in a large Corpus of English Literature
- Authors: Arthur M. Jacobs and Annette Kinder
- Abstract summary: The Gutenberg Literary English Corpus (GLEC) provides a rich source of textual data for research in digital humanities, computational linguistics or neurocognitive poetics.
We report the results of three studies providing i) topic and sentiment analyses for six text categories of GLEC and its >100 authors, ii) novel measures of semantic complexity as indices of the literariness, creativity and book beauty of the works in GLEC.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Gutenberg Literary English Corpus (GLEC, Jacobs, 2018a) provides a rich
source of textual data for research in digital humanities, computational
linguistics or neurocognitive poetics. In this study we address differences
among the different literature categories in GLEC, as well as differences
between authors. We report the results of three studies providing i) topic and
sentiment analyses for six text categories of GLEC (i.e., children and youth,
essays, novels, plays, poems, stories) and its >100 authors, ii) novel measures
of semantic complexity as indices of the literariness, creativity and book
beauty of the works in GLEC (e.g., Jane Austen's six novels), and iii) two
experiments on text classification and authorship recognition using novel
features of semantic complexity. The data on two novel measures estimating a
text's literariness, intratextual variance and stepwise distance (van
Cranenburgh et al., 2019) revealed that plays are the most literary texts in
GLEC, followed by poems and novels. Computation of a novel index of text
creativity (Gray et al., 2016) revealed poems and plays as the most creative
categories with the most creative authors all being poets (Milton, Pope, Keats,
Byron, or Wordsworth). We also computed a novel index of perceived beauty of
verbal art (Kintsch, 2012) for the works in GLEC and predict that Emma is the
theoretically most beautiful of Austen's novels. Finally, we demonstrate that
these novel measures of semantic complexity are important features for text
classification and authorship recognition with overall predictive accuracies in
the range of .75 to .97. Our data pave the way for future computational and
empirical studies of literature or experiments in reading psychology and offer
multiple baselines and benchmarks for analysing and validating other book
corpora.
Related papers
- Latent Structures of Intertextuality in French Fiction [0.0]
This paper argues that the field of computational literary studies is the ideal place to conduct a study of intertextuality.
We present a work on a corpus of more than 12.000 French fictions from the 18th, 19th and early 20th century.
arXiv Detail & Related papers (2024-10-23T10:50:40Z) - Inclusivity in Large Language Models: Personality Traits and Gender Bias in Scientific Abstracts [49.97673761305336]
We evaluate three large language models (LLMs) for their alignment with human narrative styles and potential gender biases.
Our findings indicate that, while these models generally produce text closely resembling human authored content, variations in stylistic features suggest significant gender biases.
arXiv Detail & Related papers (2024-06-27T19:26:11Z) - LFED: A Literary Fiction Evaluation Dataset for Large Language Models [58.85989777743013]
We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries.
We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions.
We conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations.
arXiv Detail & Related papers (2024-05-16T15:02:24Z) - An Inclusive Notion of Text [69.36678873492373]
We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP.
We introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling.
arXiv Detail & Related papers (2022-11-10T14:26:43Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Syllabic Quantity Patterns as Rhythmic Features for Latin Authorship
Attribution [74.27826764855911]
We employ syllabic quantity as a base for deriving rhythmic features for the task of computational authorship attribution of Latin prose texts.
Our experiments, carried out on three different datasets, using two different machine learning methods, show that rhythmic features based on syllabic quantity are beneficial in discriminating among Latin prose authors.
arXiv Detail & Related papers (2021-10-27T06:25:31Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Modeling Social Readers: Novel Tools for Addressing Reception from
Online Book Reviews [0.0]
We study the readers' distillation of the main storylines in a novel using a corpus of reviews of five popular novels.
We make three important contributions to the study of infinite vocabulary networks.
We present a new sequencing algorithm, REV2SEQ, that generates a consensus sequence of events based on partial trajectories aggregated from the reviews.
arXiv Detail & Related papers (2021-05-03T20:10:14Z) - Quasi Error-free Text Classification and Authorship Recognition in a
large Corpus of English Literature based on a Novel Feature Set [0.0]
We show that in the entire GLEC quasi error-free text classification and authorship recognition is possible with a method using the same set of five style and five content features.
Our data pave the way for many future computational and empirical studies of literature or experiments in reading psychology.
arXiv Detail & Related papers (2020-10-21T07:39:55Z) - A Comparative Study of Feature Types for Age-Based Text Classification [3.867363075280544]
We compare the effectiveness of various types of linguistic features for the task of age-based classification of fiction texts.
The results obtained show that the features describing the text at the document level can significantly increase the quality of machine learning models.
arXiv Detail & Related papers (2020-09-24T18:41:10Z) - Comparative Computational Analysis of Global Structure in Canonical,
Non-Canonical and Non-Literary Texts [0.0]
Three text types (non-literary, literary/canonical and literary/non-canonical) exhibit systematic differences with respect to structural design features as correlates of aesthetic responses in readers.
Two aspects of global structure are investigated, variability and self-similar (fractal) patterns, which reflect long-range correlations along texts.
Our results show that low-level properties of texts are better discriminators than high-level properties, for the three text types under analysis.
arXiv Detail & Related papers (2020-08-25T09:37:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.