Related papers: Metrical Tagging in the Wild: Building and Annotating Poetry Corpora with Rhythmic Features

Metrical Tagging in the Wild: Building and Annotating Poetry Corpora with Rhythmic Features

URL: http://arxiv.org/abs/2102.08858v1
Date: Wed, 17 Feb 2021 16:38:57 GMT
Title: Metrical Tagging in the Wild: Building and Annotating Poetry Corpora with Rhythmic Features
Authors: Thomas Haider
Abstract summary: We provide large poetry corpora for English and German, and annotate prosodic features in smaller corpora to train corpus driven neural models. We show that BiLSTM-CRF models with syllable embeddings outperform a CRF baseline and different BERT-based approaches.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A prerequisite for the computational study of literature is the availability of properly digitized texts, ideally with reliable meta-data and ground-truth annotation. Poetry corpora do exist for a number of languages, but larger collections lack consistency and are encoded in various standards, while annotated corpora are typically constrained to a particular genre and/or were designed for the analysis of certain linguistic features (like rhyme). In this work, we provide large poetry corpora for English and German, and annotate prosodic features in smaller corpora to train corpus driven neural models that enable robust large scale analysis. We show that BiLSTM-CRF models with syllable embeddings outperform a CRF baseline and different BERT-based approaches. In a multi-task setup, particular beneficial task relations illustrate the inter-dependence of poetic features. A model learns foot boundaries better when jointly predicting syllable stress, aesthetic emotions and verse measures benefit from each other, and we find that caesuras are quite dependent on syntax and also integral to shaping the overall measure of the line.

Related papers

Linguistically Informed Graph Model and Semantic Contrastive Learning for Korean Short Text Classification [2.4071330817126477]
We propose LIGRAM, a hierarchical heterogeneous graph model for Korean short-text classification.<n>The proposed model constructs sub-graphs at the morpheme, part-of-speech, and named-entity levels and hierarchically integrates them to compensate for the limited contextual information in short texts.<n>We evaluate LIGRAM on four Korean short-text datasets, where it consistently outperforms existing baseline models.
arXiv Detail & Related papers (2026-03-04T02:17:13Z)
METRICALARGS: A Taxonomy for Studying Metrical Poetry with LLMs [4.33144664431421]
We introduce MetricalARGS, the first taxonomy of poetry-related NLP tasks designed to evaluate Large Language Models.<n>We discuss how these tasks relate to existing NLP tasks, addressing questions around datasets and evaluation metrics.
arXiv Detail & Related papers (2025-10-09T13:14:38Z)
From Plain Text to Poetic Form: Generating Metrically-Constrained Sanskrit Verses [22.08984009109879]
We introduce a dataset designed for translating English prose into structured Sanskrit verse.<n>We explore constrained decoding strategies and instruction-based fine-tuning tailored to metrical and semantic fidelity.
arXiv Detail & Related papers (2025-06-01T03:35:46Z)
Entropy and type-token ratio in gigaword corpora [0.0]
We investigate entropy and text-token ratio, two metrics for lexical diversities, in six massive linguistic datasets in English, Spanish, and Turkish. We find a functional relation between entropy and text-token ratio that holds across the corpora under consideration. Our results contribute to the theoretical understanding of text structure and offer practical implications for fields like natural language processing.
arXiv Detail & Related papers (2024-11-15T14:40:59Z)
Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets [3.0040661953201475]
Large language models (LLMs) can now generate and recognize poetry. We develop a task to evaluate how well LLMs recognize one aspect of English-language poetry. We show that state-of-the-art LLMs can successfully identify both common and uncommon fixed poetic forms.
arXiv Detail & Related papers (2024-06-27T05:36:53Z)
A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models. We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z)
ALBERTI, a Multilingual Domain Specific Language Model for Poetry Analysis [0.0]
We present textscAlberti, the first multilingual pre-trained large language model for poetry. We further trained multilingual BERT on a corpus of over 12 million verses from 12 languages. textscAlberti achieves state-of-the-art results for German when compared to rule-based systems.
arXiv Detail & Related papers (2023-07-03T22:50:53Z)
PoetryDiffusion: Towards Joint Semantic and Metrical Manipulation in Poetry Generation [58.36105306993046]
Controllable text generation is a challenging and meaningful field in natural language generation (NLG) In this paper, we pioneer the use of the Diffusion model for generating sonnets and Chinese SongCi poetry. Our model outperforms existing models in automatic evaluation of semantic, metrical, and overall performance as well as human evaluation.
arXiv Detail & Related papers (2023-06-14T11:57:31Z)
How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text. Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions? We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z)
Syllabic Quantity Patterns as Rhythmic Features for Latin Authorship Attribution [74.27826764855911]
We employ syllabic quantity as a base for deriving rhythmic features for the task of computational authorship attribution of Latin prose texts. Our experiments, carried out on three different datasets, using two different machine learning methods, show that rhythmic features based on syllabic quantity are beneficial in discriminating among Latin prose authors.
arXiv Detail & Related papers (2021-10-27T06:25:31Z)
Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation. This paper aims to address the issue with a mask-and-predict strategy. We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions. Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z)
Sentiment analysis in tweets: an assessment study from classical to modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information. Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks. This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
Quasi Error-free Text Classification and Authorship Recognition in a large Corpus of English Literature based on a Novel Feature Set [0.0]
We show that in the entire GLEC quasi error-free text classification and authorship recognition is possible with a method using the same set of five style and five content features. Our data pave the way for many future computational and empirical studies of literature or experiments in reading psychology.
arXiv Detail & Related papers (2020-10-21T07:39:55Z)
DISCO PAL: Diachronic Spanish Sonnet Corpus with Psychological and Affective Labels [1.7205106391379026]
This article presents a study over an annotated corpus of Spanish sonnets, in order to analyse if it is possible to build features from their individual words for predicting their GAM. The corpus used contains 274 Spanish sonnets from authors of different centuries, from 15th to 19th. Thanks to this, the corpus of sonnets can be used in different applications, such as poetry recommender systems, personality text mining studies of the authors, or the usage of poetry for therapeutic purposes.
arXiv Detail & Related papers (2020-07-09T08:26:22Z)
Self-organizing Pattern in Multilayer Network for Words and Syllables [17.69876273827734]
We propose a new universal law that highlights the equally important role of syllables. By plotting rank-rank frequency distribution of word and syllable for English and Chinese corpora, visible lines appear and can be fit to a master curve.
arXiv Detail & Related papers (2020-05-05T12:01:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.