PARSI: Persian Authorship Recognition via Stylometric Integration
- URL: http://arxiv.org/abs/2506.21840v1
- Date: Fri, 27 Jun 2025 01:08:52 GMT
- Title: PARSI: Persian Authorship Recognition via Stylometric Integration
- Authors: Kourosh Shahnazari, Mohammadali Keshtparvar, Seyed Moein Ayyoubzadeh,
- Abstract summary: We employ a multi-input neural framework to determine authorship among 67 prominent Persian poets.<n>We compiled a vast corpus of 647,653 verses of the Ganjoor digital collection, validating the data through strict preprocessing and author verification.<n>Our work focuses on the integration of deep representational forms with domain-specific features for improved authorship attribution.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The intricate linguistic, stylistic, and metrical aspects of Persian classical poetry pose a challenge for computational authorship attribution. In this work, we present a versatile framework to determine authorship among 67 prominent poets. We employ a multi-input neural framework consisting of a transformer-based language encoder complemented by features addressing the semantic, stylometric, and metrical dimensions of Persian poetry. Our feature set encompasses 100-dimensional Word2Vec embeddings, seven stylometric measures, and categorical encodings of poetic form and meter. We compiled a vast corpus of 647,653 verses of the Ganjoor digital collection, validating the data through strict preprocessing and author verification while preserving poem-level splitting to prevent overlap. This work employs verse-level classification and majority and weighted voting schemes in evaluation, revealing that weighted voting yields 71% accuracy. We further investigate threshold-based decision filtering, allowing the model to generate highly confident predictions, achieving 97% accuracy at a 0.9 threshold, though at lower coverage. Our work focuses on the integration of deep representational forms with domain-specific features for improved authorship attribution. The results illustrate the potential of our approach for automated classification and the contribution to stylistic analysis, authorship disputes, and general computational literature research. This research will facilitate further research on multilingual author attribution, style shift, and generative modeling of Persian poetry.
Related papers
- NAZM: Network Analysis of Zonal Metrics in Persian Poetic Tradition [0.0]
This study formalizes a computational model to simulate classical Persian poets' dynamics of influence.<n>We draw upon semantic, lexical, stylistic, thematic, and metrical features to demarcate each poet's corpus.<n>For typological insight, we use the Louvain community detection algorithm to demarcate clusters of poets sharing both style and theme coherence.
arXiv Detail & Related papers (2025-05-12T20:39:53Z) - Author-Specific Linguistic Patterns Unveiled: A Deep Learning Study on Word Class Distributions [0.0]
This study investigates author-specific word class distributions using part-of-speech (POS) tagging and bigram analysis.<n>By leveraging deep neural networks, we classify literary authors based on POS tag vectors and bigram frequency matrices derived from their works.
arXiv Detail & Related papers (2025-01-17T09:43:49Z) - AraPoemBERT: A Pretrained Language Model for Arabic Poetry Analysis [0.0]
We introduce AraPoemBERT, an Arabic language model pretrained exclusively on Arabic poetry text.
AraPoemBERT achieved unprecedented accuracy in two out of three novel tasks: poet's gender classification and poetry sub-meter classification.
The dataset used in this study contains more than 2.09 million verses collected from online sources.
arXiv Detail & Related papers (2024-03-19T02:59:58Z) - Understanding writing style in social media with a supervised
contrastively pre-trained transformer [57.48690310135374]
Online Social Networks serve as fertile ground for harmful behavior, ranging from hate speech to the dissemination of disinformation.
We introduce the Style Transformer for Authorship Representations (STAR), trained on a large corpus derived from public sources of 4.5 x 106 authored texts.
Using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy.
arXiv Detail & Related papers (2023-10-17T09:01:17Z) - Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects.
In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts.
We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z) - PoetryDiffusion: Towards Joint Semantic and Metrical Manipulation in
Poetry Generation [58.36105306993046]
Controllable text generation is a challenging and meaningful field in natural language generation (NLG)
In this paper, we pioneer the use of the Diffusion model for generating sonnets and Chinese SongCi poetry.
Our model outperforms existing models in automatic evaluation of semantic, metrical, and overall performance as well as human evaluation.
arXiv Detail & Related papers (2023-06-14T11:57:31Z) - Huruf: An Application for Arabic Handwritten Character Recognition Using
Deep Learning [0.0]
We propose a lightweight Convolutional Neural Network-based architecture for recognizing Arabic characters and digits.
The proposed pipeline consists of a total of 18 layers containing four layers each for convolution, pooling, batch normalization, dropout, and finally one Global average layer.
The proposed model respectively achieved an accuracy of 96.93% and 99.35% which is comparable to the state-of-the-art and makes it a suitable solution for real-life end-level applications.
arXiv Detail & Related papers (2022-12-16T17:39:32Z) - PART: Pre-trained Authorship Representation Transformer [52.623051272843426]
Authors writing documents imprint identifying information within their texts.<n>Previous works use hand-crafted features or classification tasks to train their authorship models.<n>We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - On the Usefulness of Embeddings, Clusters and Strings for Text Generator
Evaluation [86.19634542434711]
Mauve measures an information-theoretic divergence between two probability distributions over strings.
We show that Mauve was right for the wrong reasons, and that its newly proposed divergence is not necessary for its high performance.
We conclude that -- by encoding syntactic- and coherence-level features of text, while ignoring surface-level features -- such cluster-based substitutes to string distributions may simply be better for evaluating state-of-the-art language generators.
arXiv Detail & Related papers (2022-05-31T17:58:49Z) - Syllabic Quantity Patterns as Rhythmic Features for Latin Authorship
Attribution [74.27826764855911]
We employ syllabic quantity as a base for deriving rhythmic features for the task of computational authorship attribution of Latin prose texts.
Our experiments, carried out on three different datasets, using two different machine learning methods, show that rhythmic features based on syllabic quantity are beneficial in discriminating among Latin prose authors.
arXiv Detail & Related papers (2021-10-27T06:25:31Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - The Challenges of Persian User-generated Textual Content: A Machine
Learning-Based Approach [0.0]
This research applies machine learning-based approaches to tackle the hurdles that come with Persian user-generated textual content.
The presented approach uses a machine-translated datasets to conduct sentiment analysis for the Persian language.
The results of the experiments have shown promising state-of-the-art performance in contrast to the previous efforts.
arXiv Detail & Related papers (2021-01-20T11:57:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.