Related papers: From stage to page: language independent bootstrap measures of distinctiveness in fictional speech

From stage to page: language independent bootstrap measures of distinctiveness in fictional speech

URL: http://arxiv.org/abs/2301.05659v1
Date: Fri, 13 Jan 2023 16:58:43 GMT
Title: From stage to page: language independent bootstrap measures of distinctiveness in fictional speech
Authors: Artjoms \v{S}e\c{l}a and Ben Nagy and Joanna Byszuk and Laura Hern\'andez-Lorenzo and Botond Szemes and Maciej Eder
Abstract summary: We introduce and evaluate two non-parametric methods to produce a summary statistic for character distinctiveness. We analyse 3301 characters drawn from 2324 works, covering five centuries and four languages. Based on exploratory analysis, we find that smaller characters tend to be more distinctive, and that women are cross-linguistically more distinctive than men.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Stylometry is mostly applied to authorial style. Recently, researchers have begun investigating the style of characters, finding that the variation remains within authorial bounds. We address the stylistic distinctiveness of characters in drama. Our primary contribution is methodological; we introduce and evaluate two non-parametric methods to produce a summary statistic for character distinctiveness that can be usefully applied and compared across languages and times. Our first method is based on bootstrap distances between 3-gram probability distributions, the second (reminiscent of 'unmasking' techniques) on word keyness curves. Both methods are validated and explored by applying them to a reasonably large corpus (a subset of DraCor): we analyse 3301 characters drawn from 2324 works, covering five centuries and four languages (French, German, Russian, and the works of Shakespeare). Both methods appear useful; the 3-gram method is statistically more powerful but the word keyness method offers rich interpretability. Both methods are able to capture phonological differences such as accent or dialect, as well as broad differences in topic and lexical richness. Based on exploratory analysis, we find that smaller characters tend to be more distinctive, and that women are cross-linguistically more distinctive than men, with this latter finding carefully interrogated using multiple regression. This greater distinctiveness stems from a historical tendency for female characters to be restricted to an 'internal narrative domain' covering mainly direct discourse and family/romantic themes. It is hoped that direct, comparable statistical measures will form a basis for more sophisticated future studies, and advances in theory.

Related papers

A Distributional Perspective on Word Learning in Neural Language Models [57.41607944290822]
There are no widely agreed-upon metrics for word learning in language models. We argue that distributional signatures studied in prior work fail to capture key distributional information. We obtain learning trajectories for a selection of small language models we train from scratch.
arXiv Detail & Related papers (2025-02-09T13:15:59Z)
Looking for the Inner Music: Probing LLMs' Understanding of Literary Style [3.5757761767474876]
Authorial style is easier to define than genre-level style. pronoun usage and word order prove significant for defining both kinds of literary style.
arXiv Detail & Related papers (2025-02-05T22:20:17Z)
Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process. We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous. Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z)
Towards Unsupervised Recognition of Token-level Semantic Differences in Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task. We study three unsupervised approaches that rely on a masked language model. Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z)
Eyettention: An Attention-based Dual-Sequence Model for Predicting Human Scanpaths during Reading [3.9766585251585282]
We develop Eyettention, the first dual-sequence model that simultaneously processes the sequence of words and the chronological sequence of fixations. We show that Eyettention outperforms state-of-the-art models in predicting scanpaths.
arXiv Detail & Related papers (2023-04-21T07:26:49Z)
Textual Entailment Recognition with Semantic Features from Empirical Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text. In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis. We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z)
Cross-Lingual Speaker Identification Using Distant Supervision [84.51121411280134]
We propose a speaker identification framework that addresses issues such as lack of contextual reasoning and poor cross-lingual generalization. We show that the resulting model outperforms previous state-of-the-art methods on two English speaker identification benchmarks by up to 9% in accuracy and 5% with only distant supervision.
arXiv Detail & Related papers (2022-10-11T20:49:44Z)
Textual Stylistic Variation: Choices, Genres and Individuals [0.8057441774248633]
This chapter argues for more informed target metrics for the statistical processing of stylistic variation in text collections. This chapter discusses variation given by genre, and contrasts it to variation occasioned by individual choice.
arXiv Detail & Related papers (2022-05-01T16:39:49Z)
Idiosyncratic but not Arbitrary: Learning Idiolects in Online Registers Reveals Distinctive yet Consistent Individual Styles [7.4037154707453965]
We introduce a new approach to studying idiolects through a massive cross-author comparison to identify and encode stylistic features. A neural model achieves strong performance at authorship identification on short texts. We quantify the relative contributions of different linguistic elements to idiolectal variation.
arXiv Detail & Related papers (2021-09-07T15:49:23Z)
Sentiment analysis in tweets: an assessment study from classical to modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information. Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks. This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
Disambiguatory Signals are Stronger in Word-initial Positions [48.18148856974974]
We point out the confounds in existing methods for comparing the informativeness of segments early in the word versus later in the word. We find evidence across hundreds of languages that indeed there is a cross-linguistic tendency to front-load information in words.
arXiv Detail & Related papers (2021-02-03T18:19:16Z)
Aspectuality Across Genre: A Distributional Semantics Approach [25.816944882581343]
The interpretation of the lexical aspect of verbs in English plays a crucial role for recognizing textual entailment and learning discourse-level inferences. We show that two elementary dimensions of aspectual class, states vs. events, and telic vs. atelic events, can be modelled effectively with distributional semantics.
arXiv Detail & Related papers (2020-10-31T19:37:22Z)
Pick a Fight or Bite your Tongue: Investigation of Gender Differences in Idiomatic Language Usage [9.892162266128306]
We compile a novel, large and diverse corpus of spontaneous linguistic productions annotated with speakers' gender. We perform a first large-scale empirical study of distinctions in the usage of textitfigurative language between male and female authors.
arXiv Detail & Related papers (2020-10-31T18:44:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.