Metrical Tagging in the Wild: Building and Annotating Poetry Corpora
with Rhythmic Features
- URL: http://arxiv.org/abs/2102.08858v1
- Date: Wed, 17 Feb 2021 16:38:57 GMT
- Title: Metrical Tagging in the Wild: Building and Annotating Poetry Corpora
with Rhythmic Features
- Authors: Thomas Haider
- Abstract summary: We provide large poetry corpora for English and German, and annotate prosodic features in smaller corpora to train corpus driven neural models.
We show that BiLSTM-CRF models with syllable embeddings outperform a CRF baseline and different BERT-based approaches.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A prerequisite for the computational study of literature is the availability
of properly digitized texts, ideally with reliable meta-data and ground-truth
annotation. Poetry corpora do exist for a number of languages, but larger
collections lack consistency and are encoded in various standards, while
annotated corpora are typically constrained to a particular genre and/or were
designed for the analysis of certain linguistic features (like rhyme). In this
work, we provide large poetry corpora for English and German, and annotate
prosodic features in smaller corpora to train corpus driven neural models that
enable robust large scale analysis.
We show that BiLSTM-CRF models with syllable embeddings outperform a CRF
baseline and different BERT-based approaches. In a multi-task setup, particular
beneficial task relations illustrate the inter-dependence of poetic features. A
model learns foot boundaries better when jointly predicting syllable stress,
aesthetic emotions and verse measures benefit from each other, and we find that
caesuras are quite dependent on syntax and also integral to shaping the overall
measure of the line.
Related papers
- Entropy and type-token ratio in gigaword corpora [0.0]
We investigate entropy and text-token ratio, two metrics for lexical diversities, in six massive linguistic datasets in English, Spanish, and Turkish.
We find a functional relation between entropy and text-token ratio that holds across the corpora under consideration.
Our results contribute to the theoretical understanding of text structure and offer practical implications for fields like natural language processing.
arXiv Detail & Related papers (2024-11-15T14:40:59Z) - Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets [3.0040661953201475]
Large language models (LLMs) can now generate and recognize poetry.
We develop a task to evaluate how well LLMs recognize one aspect of English-language poetry.
We show that state-of-the-art LLMs can successfully identify both common and uncommon fixed poetic forms.
arXiv Detail & Related papers (2024-06-27T05:36:53Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - ALBERTI, a Multilingual Domain Specific Language Model for Poetry
Analysis [0.0]
We present textscAlberti, the first multilingual pre-trained large language model for poetry.
We further trained multilingual BERT on a corpus of over 12 million verses from 12 languages.
textscAlberti achieves state-of-the-art results for German when compared to rule-based systems.
arXiv Detail & Related papers (2023-07-03T22:50:53Z) - PoetryDiffusion: Towards Joint Semantic and Metrical Manipulation in
Poetry Generation [58.36105306993046]
Controllable text generation is a challenging and meaningful field in natural language generation (NLG)
In this paper, we pioneer the use of the Diffusion model for generating sonnets and Chinese SongCi poetry.
Our model outperforms existing models in automatic evaluation of semantic, metrical, and overall performance as well as human evaluation.
arXiv Detail & Related papers (2023-06-14T11:57:31Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Syllabic Quantity Patterns as Rhythmic Features for Latin Authorship
Attribution [74.27826764855911]
We employ syllabic quantity as a base for deriving rhythmic features for the task of computational authorship attribution of Latin prose texts.
Our experiments, carried out on three different datasets, using two different machine learning methods, show that rhythmic features based on syllabic quantity are beneficial in discriminating among Latin prose authors.
arXiv Detail & Related papers (2021-10-27T06:25:31Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Quasi Error-free Text Classification and Authorship Recognition in a
large Corpus of English Literature based on a Novel Feature Set [0.0]
We show that in the entire GLEC quasi error-free text classification and authorship recognition is possible with a method using the same set of five style and five content features.
Our data pave the way for many future computational and empirical studies of literature or experiments in reading psychology.
arXiv Detail & Related papers (2020-10-21T07:39:55Z) - Self-organizing Pattern in Multilayer Network for Words and Syllables [17.69876273827734]
We propose a new universal law that highlights the equally important role of syllables.
By plotting rank-rank frequency distribution of word and syllable for English and Chinese corpora, visible lines appear and can be fit to a master curve.
arXiv Detail & Related papers (2020-05-05T12:01:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.