Stylometry Analysis of Multi-authored Documents for Authorship and
Author Style Change Detection
- URL: http://arxiv.org/abs/2401.06752v1
- Date: Fri, 12 Jan 2024 18:36:41 GMT
- Title: Stylometry Analysis of Multi-authored Documents for Authorship and
Author Style Change Detection
- Authors: Muhammad Tayyab Zamir, Muhammad Asif Ayub, Asma Gul, Nasir Ahmad,
Kashif Ahmad
- Abstract summary: This paper investigates three key tasks of style analysis: (i) classification of single and multi-authored documents, (ii) single change detection, and (iii) multiple author-switching detection in multi-authored documents.
We propose a merit-based fusion framework that integrates several state-of-the-art natural language processing (NLP) algorithms and weight optimization techniques.
- Score: 2.117778717665161
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, the increasing use of Artificial Intelligence based text
generation tools has posed new challenges in document provenance,
authentication, and authorship detection. However, advancements in stylometry
have provided opportunities for automatic authorship and author change
detection in multi-authored documents using style analysis techniques. Style
analysis can serve as a primary step toward document provenance and
authentication through authorship detection. This paper investigates three key
tasks of style analysis: (i) classification of single and multi-authored
documents, (ii) single change detection, which involves identifying the point
where the author switches, and (iii) multiple author-switching detection in
multi-authored documents. We formulate all three tasks as classification
problems and propose a merit-based fusion framework that integrates several
state-of-the-art natural language processing (NLP) algorithms and weight
optimization techniques. We also explore the potential of special characters,
which are typically removed during pre-processing in NLP applications, on the
performance of the proposed methods for these tasks by conducting extensive
experiments on both cleaned and raw datasets. Experimental results demonstrate
significant improvements over existing solutions for all three tasks on a
benchmark dataset.
Related papers
- Detecting Machine-Generated Long-Form Content with Latent-Space Variables [54.07946647012579]
Existing zero-shot detectors primarily focus on token-level distributions, which are vulnerable to real-world domain shifts.
We propose a more robust method that incorporates abstract elements, such as event transitions, as key deciding factors to detect machine versus human texts.
arXiv Detail & Related papers (2024-10-04T18:42:09Z) - A Novel Dataset for Non-Destructive Inspection of Handwritten Documents [0.0]
Forensic handwriting examination aims to examine handwritten documents in order to properly define or hypothesize the manuscript's author.
We propose a new and challenging dataset consisting of two subsets: the first consists of 21 documents written either by the classic pen and paper" approach (and later digitized) and directly acquired on common devices such as tablets.
Preliminary results on the proposed datasets show that 90% classification accuracy can be achieved on the first subset.
arXiv Detail & Related papers (2024-01-09T09:25:58Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - FacTool: Factuality Detection in Generative AI -- A Tool Augmented
Framework for Multi-Task and Multi-Domain Scenarios [87.12753459582116]
A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models.
We propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models.
arXiv Detail & Related papers (2023-07-25T14:20:51Z) - Application of Transformers based methods in Electronic Medical Records:
A Systematic Literature Review [77.34726150561087]
This work presents a systematic literature review of state-of-the-art advances using transformer-based methods on electronic medical records (EMRs) in different NLP tasks.
arXiv Detail & Related papers (2023-04-05T22:19:42Z) - Document Provenance and Authentication through Authorship Classification [5.2545206693029884]
We propose an ensemble-based text-processing framework for the classification of single and multi-authored documents.
The proposed framework incorporates several state-of-the-art text classification algorithms.
The framework is evaluated on a large-scale benchmark dataset.
arXiv Detail & Related papers (2023-03-02T12:26:03Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Continuous Offline Handwriting Recognition using Deep Learning Models [0.0]
Handwritten text recognition is an open problem of great interest in the area of automatic document image analysis.
We have proposed a new recognition model based on integrating two types of deep learning architectures: convolutional neural networks (CNN) and sequence-to-sequence (seq2seq)
The new proposed model provides competitive results with those obtained with other well-established methodologies.
arXiv Detail & Related papers (2021-12-26T07:31:03Z) - Human-in-the-Loop Disinformation Detection: Stance, Sentiment, or
Something Else? [93.91375268580806]
Both politics and pandemics have recently provided ample motivation for the development of machine learning-enabled disinformation (a.k.a. fake news) detection algorithms.
Existing literature has focused primarily on the fully-automated case, but the resulting techniques cannot reliably detect disinformation on the varied topics, sources, and time scales required for military applications.
By leveraging an already-available analyst as a human-in-the-loop, canonical machine learning techniques of sentiment analysis, aspect-based sentiment analysis, and stance detection become plausible methods to use for a partially-automated disinformation detection system.
arXiv Detail & Related papers (2021-11-09T13:30:34Z) - Evaluation of a Region Proposal Architecture for Multi-task Document
Layout Analysis [0.685316573653194]
Mask-RCNN architecture is designed to address the problem of baseline detection and region segmentation.
We present experimental results on two handwritten text datasets and one handwritten music dataset.
The analyzed architecture yields promising results, outperforming state-of-the-art techniques in all three datasets.
arXiv Detail & Related papers (2021-06-22T14:07:27Z) - Extracting Procedural Knowledge from Technical Documents [1.0773368566852943]
Procedures are an important knowledge component of documents that can be leveraged by cognitive assistants for automation, question-answering or driving a conversation.
It is a challenging problem to parse big dense documents like product manuals, user guides to automatically understand which parts are talking about procedures and subsequently extract them.
arXiv Detail & Related papers (2020-10-20T09:47:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.