From stage to page: language independent bootstrap measures of
distinctiveness in fictional speech
- URL: http://arxiv.org/abs/2301.05659v1
- Date: Fri, 13 Jan 2023 16:58:43 GMT
- Title: From stage to page: language independent bootstrap measures of
distinctiveness in fictional speech
- Authors: Artjoms \v{S}e\c{l}a and Ben Nagy and Joanna Byszuk and Laura
Hern\'andez-Lorenzo and Botond Szemes and Maciej Eder
- Abstract summary: We introduce and evaluate two non-parametric methods to produce a summary statistic for character distinctiveness.
We analyse 3301 characters drawn from 2324 works, covering five centuries and four languages.
Based on exploratory analysis, we find that smaller characters tend to be more distinctive, and that women are cross-linguistically more distinctive than men.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stylometry is mostly applied to authorial style. Recently, researchers have
begun investigating the style of characters, finding that the variation remains
within authorial bounds. We address the stylistic distinctiveness of characters
in drama. Our primary contribution is methodological; we introduce and evaluate
two non-parametric methods to produce a summary statistic for character
distinctiveness that can be usefully applied and compared across languages and
times. Our first method is based on bootstrap distances between 3-gram
probability distributions, the second (reminiscent of 'unmasking' techniques)
on word keyness curves. Both methods are validated and explored by applying
them to a reasonably large corpus (a subset of DraCor): we analyse 3301
characters drawn from 2324 works, covering five centuries and four languages
(French, German, Russian, and the works of Shakespeare). Both methods appear
useful; the 3-gram method is statistically more powerful but the word keyness
method offers rich interpretability. Both methods are able to capture
phonological differences such as accent or dialect, as well as broad
differences in topic and lexical richness. Based on exploratory analysis, we
find that smaller characters tend to be more distinctive, and that women are
cross-linguistically more distinctive than men, with this latter finding
carefully interrogated using multiple regression. This greater distinctiveness
stems from a historical tendency for female characters to be restricted to an
'internal narrative domain' covering mainly direct discourse and
family/romantic themes. It is hoped that direct, comparable statistical
measures will form a basis for more sophisticated future studies, and advances
in theory.
Related papers
- Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Towards Unsupervised Recognition of Token-level Semantic Differences in
Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task.
We study three unsupervised approaches that rely on a masked language model.
Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z) - Eyettention: An Attention-based Dual-Sequence Model for Predicting Human
Scanpaths during Reading [3.9766585251585282]
We develop Eyettention, the first dual-sequence model that simultaneously processes the sequence of words and the chronological sequence of fixations.
We show that Eyettention outperforms state-of-the-art models in predicting scanpaths.
arXiv Detail & Related papers (2023-04-21T07:26:49Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - Cross-Lingual Speaker Identification Using Distant Supervision [84.51121411280134]
We propose a speaker identification framework that addresses issues such as lack of contextual reasoning and poor cross-lingual generalization.
We show that the resulting model outperforms previous state-of-the-art methods on two English speaker identification benchmarks by up to 9% in accuracy and 5% with only distant supervision.
arXiv Detail & Related papers (2022-10-11T20:49:44Z) - Textual Stylistic Variation: Choices, Genres and Individuals [0.8057441774248633]
This chapter argues for more informed target metrics for the statistical processing of stylistic variation in text collections.
This chapter discusses variation given by genre, and contrasts it to variation occasioned by individual choice.
arXiv Detail & Related papers (2022-05-01T16:39:49Z) - Idiosyncratic but not Arbitrary: Learning Idiolects in Online Registers
Reveals Distinctive yet Consistent Individual Styles [7.4037154707453965]
We introduce a new approach to studying idiolects through a massive cross-author comparison to identify and encode stylistic features.
A neural model achieves strong performance at authorship identification on short texts.
We quantify the relative contributions of different linguistic elements to idiolectal variation.
arXiv Detail & Related papers (2021-09-07T15:49:23Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Disambiguatory Signals are Stronger in Word-initial Positions [48.18148856974974]
We point out the confounds in existing methods for comparing the informativeness of segments early in the word versus later in the word.
We find evidence across hundreds of languages that indeed there is a cross-linguistic tendency to front-load information in words.
arXiv Detail & Related papers (2021-02-03T18:19:16Z) - Aspectuality Across Genre: A Distributional Semantics Approach [25.816944882581343]
The interpretation of the lexical aspect of verbs in English plays a crucial role for recognizing textual entailment and learning discourse-level inferences.
We show that two elementary dimensions of aspectual class, states vs. events, and telic vs. atelic events, can be modelled effectively with distributional semantics.
arXiv Detail & Related papers (2020-10-31T19:37:22Z) - Pick a Fight or Bite your Tongue: Investigation of Gender Differences in
Idiomatic Language Usage [9.892162266128306]
We compile a novel, large and diverse corpus of spontaneous linguistic productions annotated with speakers' gender.
We perform a first large-scale empirical study of distinctions in the usage of textitfigurative language between male and female authors.
arXiv Detail & Related papers (2020-10-31T18:44:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.