A Novel Dataset for Non-Destructive Inspection of Handwritten Documents
- URL: http://arxiv.org/abs/2401.04448v1
- Date: Tue, 9 Jan 2024 09:25:58 GMT
- Title: A Novel Dataset for Non-Destructive Inspection of Handwritten Documents
- Authors: Eleonora Breci (1), Luca Guarnera (1), Sebastiano Battiato (1) ((1)
University of Catania)
- Abstract summary: Forensic handwriting examination aims to examine handwritten documents in order to properly define or hypothesize the manuscript's author.
We propose a new and challenging dataset consisting of two subsets: the first consists of 21 documents written either by the classic pen and paper" approach (and later digitized) and directly acquired on common devices such as tablets.
Preliminary results on the proposed datasets show that 90% classification accuracy can be achieved on the first subset.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Forensic handwriting examination is a branch of Forensic Science that aims to
examine handwritten documents in order to properly define or hypothesize the
manuscript's author. These analysis involves comparing two or more (digitized)
documents through a comprehensive comparison of intrinsic local and global
features. If a correlation exists and specific best practices are satisfied,
then it will be possible to affirm that the documents under analysis were
written by the same individual. The need to create sophisticated tools capable
of extracting and comparing significant features has led to the development of
cutting-edge software with almost entirely automated processes, improving the
forensic examination of handwriting and achieving increasingly objective
evaluations. This is made possible by algorithmic solutions based on purely
mathematical concepts. Machine Learning and Deep Learning models trained with
specific datasets could turn out to be the key elements to best solve the task
at hand. In this paper, we proposed a new and challenging dataset consisting of
two subsets: the first consists of 21 documents written either by the classic
``pen and paper" approach (and later digitized) and directly acquired on common
devices such as tablets; the second consists of 362 handwritten manuscripts by
124 different people, acquired following a specific pipeline. Our study
pioneered a comparison between traditionally handwritten documents and those
produced with digital tools (e.g., tablets). Preliminary results on the
proposed datasets show that 90% classification accuracy can be achieved on the
first subset (documents written on both paper and pen and later digitized and
on tablets) and 96% on the second portion of the data. The datasets are
available at
https://iplab.dmi.unict.it/mfs/forensic-handwriting-analysis/novel-dataset-2023/.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Innovative Methods for Non-Destructive Inspection of Handwritten
Documents [0.0]
We present a framework capable of extracting and analyzing intrinsic measures of manuscript documents using image processing and deep learning techniques.
By quantifying the Euclidean distance between the feature vectors of the documents to be compared, authorship can be discerned.
Experimental results demonstrate the ability of our method to objectively determine authorship in different writing media, outperforming the state of the art.
arXiv Detail & Related papers (2023-10-17T12:45:04Z) - Document Provenance and Authentication through Authorship Classification [5.2545206693029884]
We propose an ensemble-based text-processing framework for the classification of single and multi-authored documents.
The proposed framework incorporates several state-of-the-art text classification algorithms.
The framework is evaluated on a large-scale benchmark dataset.
arXiv Detail & Related papers (2023-03-02T12:26:03Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets.
The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z) - Handwriting Classification for the Analysis of Art-Historical Documents [6.918282834668529]
We focus on the analysis of handwriting in scanned documents from the art-historic archive of the WPI.
We propose a handwriting classification model that labels extracted text fragments based on their visual structure.
arXiv Detail & Related papers (2020-11-04T13:06:46Z) - Learning from similarity and information extraction from structured
documents [0.0]
The aim is to improve micro F1 of per-word classification on a huge real-world document dataset.
Results confirm that all proposed architecture parts are all required to beat the previous results.
The best model improves the previous state-of-the-art results by an 8.25 gain in F1 score.
arXiv Detail & Related papers (2020-10-17T21:34:52Z) - Neural Deepfake Detection with Factual Structure of Text [78.30080218908849]
We propose a graph-based model for deepfake detection of text.
Our approach represents the factual structure of a given document as an entity graph.
Our model can distinguish the difference in the factual structure between machine-generated text and human-written text.
arXiv Detail & Related papers (2020-10-15T02:35:31Z) - Scaling Systematic Literature Reviews with Machine Learning Pipelines [57.82662094602138]
Systematic reviews entail the extraction of data from scientific documents.
We construct a pipeline that automates each of these aspects, and experiment with many human-time vs. system quality trade-offs.
We find that we can get surprising accuracy and generalisability of the whole pipeline system with only 2 weeks of human-expert annotation.
arXiv Detail & Related papers (2020-10-09T16:19:42Z) - Fast(er) Reconstruction of Shredded Text Documents via Self-Supervised
Deep Asymmetric Metric Learning [62.34197797857823]
A central problem in automatic reconstruction of shredded documents is the pairwise compatibility evaluation of the shreds.
This work proposes a scalable deep learning approach for measuring pairwise compatibility in which the number of inferences scales linearly.
Our method has accuracy comparable to the state-of-the-art with a speed-up of about 22 times for a test instance with 505 shreds.
arXiv Detail & Related papers (2020-03-23T03:22:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.