A data science and machine learning approach to continuous analysis of
Shakespeare's plays
- URL: http://arxiv.org/abs/2301.06024v3
- Date: Tue, 11 Jul 2023 14:20:54 GMT
- Title: A data science and machine learning approach to continuous analysis of
Shakespeare's plays
- Authors: Charles Swisher, Lior Shamir
- Abstract summary: We apply machine learning analysis to the work of William Shakespeare.
The analysis shows clear changes in the style of writing over time.
Applying machine learning to make a stylometric prediction of the year of the play shows a Pearson correlation of 0.71.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The availability of quantitative text analysis methods has provided new ways
of analyzing literature in a manner that was not available in the
pre-information era. Here we apply comprehensive machine learning analysis to
the work of William Shakespeare. The analysis shows clear changes in the style
of writing over time, with the most significant changes in the sentence length,
frequency of adjectives and adverbs, and the sentiments expressed in the text.
Applying machine learning to make a stylometric prediction of the year of the
play shows a Pearson correlation of 0.71 between the actual and predicted year,
indicating that Shakespeare's writing style as reflected by the quantitative
measurements changed over time. Additionally, it shows that the stylometrics of
some of the plays is more similar to plays written either before or after the
year they were written. For instance, Romeo and Juliet is dated 1596, but is
more similar in stylometrics to plays written by Shakespeare after 1600. The
source code for the analysis is available for free download.
Related papers
- You Shall Know a Tool by the Traces it Leaves: The Predictability of Sentiment Analysis Tools [74.98850427240464]
We show that sentiment analysis tools disagree on the same dataset.
We show that the sentiment tool used for sentiment annotation can even be predicted from its outcome.
arXiv Detail & Related papers (2024-10-18T17:27:38Z) - LFED: A Literary Fiction Evaluation Dataset for Large Language Models [58.85989777743013]
We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries.
We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions.
We conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations.
arXiv Detail & Related papers (2024-05-16T15:02:24Z) - A Comparison of Lexicon-Based and ML-Based Sentiment Analysis: Are There
Outlier Words? [14.816706893177997]
In this paper we compute sentiment for more than 150,000 English language texts drawn from 4 domains.
We model differences in sentiment scores between approaches for documents in each domain using a regression.
Our findings are that the importance of a word depends on the domain and there are no standout lexical entries which systematically cause differences in sentiment scores.
arXiv Detail & Related papers (2023-11-10T18:21:50Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - A decomposition of book structure through ousiometric fluctuations in
cumulative word-time [1.181206257787103]
We look at how words change over the course of a book as a function of the number of words, rather than the fraction of the book.
We find that shorter books exhibit only a general trend, while longer books have fluctuations in addition to the general trend.
Our findings suggest that, in the ousiometric sense, longer books are not expanded versions of shorter books, but are more similar in structure to a concatenation of shorter texts.
arXiv Detail & Related papers (2022-08-19T18:17:27Z) - Semantic Analysis for Automated Evaluation of the Potential Impact of
Research Articles [62.997667081978825]
This paper presents a novel method for vector representation of text meaning based on information theory.
We show how this informational semantics is used for text classification on the basis of the Leicester Scientific Corpus.
We show that an informational approach to representing the meaning of a text has offered a way to effectively predict the scientific impact of research papers.
arXiv Detail & Related papers (2021-04-26T20:37:13Z) - Multiple regression techniques for modeling dates of first performances
of Shakespeare-era plays [2.1827922098806214]
We took a set of Shakespeare-era plays (181 plays from the period 1585--1610) and added the best-guess dates for them from a standard reference work as metadata.
We applied 11 regression methods to predict the dates of the plays at an 80/20 training/test split.
An in-depth analysis of the most commonly occurring 20 words in the models in 100 independent runs helps explain the trends in linguistic and stylistic terms.
arXiv Detail & Related papers (2021-04-13T04:13:53Z) - My Teacher Thinks The World Is Flat! Interpreting Automatic Essay
Scoring Mechanism [71.34160809068996]
Recent work shows that automated scoring systems are prone to even common-sense adversarial samples.
We utilize recent advances in interpretability to find the extent to which features such as coherence, content and relevance are important for automated scoring mechanisms.
We also find that since the models are not semantically grounded with world-knowledge and common sense, adding false facts such as the world is flat'' actually increases the score instead of decreasing it.
arXiv Detail & Related papers (2020-12-27T06:19:20Z) - Generalized Word Shift Graphs: A Method for Visualizing and Explaining
Pairwise Comparisons Between Texts [0.15833270109954134]
A common task in computational text analyses is to quantify how two corpora differ according to a measurement like word frequency, sentiment, or information content.
We introduce generalized word shift graphs, visualizations which yield a meaningful and interpretable summary of how individual words contribute to the variation between two texts.
We show that this framework naturally encompasses many of the most commonly used approaches for comparing texts, including relative frequencies, dictionary scores, and entropy-based measures like the Kullback-Leibler and Jensen-Shannon divergences.
arXiv Detail & Related papers (2020-08-05T17:27:11Z) - Quality of Word Embeddings on Sentiment Analysis Tasks [0.0]
We compare performance of a dozen of pretrained word embedding models on lyrics sentiment analysis and movie review polarity tasks.
According to our results, Twitter Tweets is the best on lyrics sentiment analysis, whereas Google News and Common Crawl are the top performers on movie polarity analysis.
arXiv Detail & Related papers (2020-03-06T15:03:08Z) - Learning Dynamic Belief Graphs to Generalize on Text-Based Games [55.59741414135887]
Playing text-based games requires skills in processing natural language and sequential decision making.
In this work, we investigate how an agent can plan and generalize in text-based games using graph-structured representations learned end-to-end from raw text.
arXiv Detail & Related papers (2020-02-21T04:38:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.