Introducing a new high-resolution handwritten digits data set with
writer characteristics
- URL: http://arxiv.org/abs/2011.07946v3
- Date: Wed, 13 Apr 2022 21:46:00 GMT
- Title: Introducing a new high-resolution handwritten digits data set with
writer characteristics
- Authors: C\'edric Beaulac, Jeffrey S. Rosenthal
- Abstract summary: We introduce a new handwritten digit data set that we collected.
It contains high-resolution images of handwritten digits together with various writer characteristics.
Multiple writer characteristics gathered are a novelty of our data set and create new research opportunities.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The contributions in this article are two-fold. First, we introduce a new
hand-written digit data set that we collected. It contains high-resolution
images of hand-written The contributions in this article are two-fold. First,
we introduce a new handwritten digit data set that we collected. It contains
high-resolution images of handwritten digits together with various writer
characteristics which are not available in the well-known MNIST database. The
multiple writer characteristics gathered are a novelty of our data set and
create new research opportunities. The data set is publicly available online.
Second, we analyse this new data set. We begin with simple supervised tasks. We
assess the predictability of the writer characteristics gathered, the effect of
using some of those characteristics as predictors in classification task and
the effect of higher resolution images on classification accuracy. We also
explore semi-supervised applications; we can leverage the high quantity of
handwritten digits data sets already existing online to improve the accuracy of
various classifications task with noticeable success. Finally, we also
demonstrate the generative perspective offered by this new data set; we are
able to generate images that mimics the writing style of specific writers. The
data set has unique and distinct features and our analysis establishes
benchmarks and showcases some of the new opportunities made possible with this
new data set.
Related papers
- BookWorm: A Dataset for Character Description and Analysis [59.186325346763184]
We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation.
We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses.
Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks.
arXiv Detail & Related papers (2024-10-14T10:55:58Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - A Novel Dataset for Non-Destructive Inspection of Handwritten Documents [0.0]
Forensic handwriting examination aims to examine handwritten documents in order to properly define or hypothesize the manuscript's author.
We propose a new and challenging dataset consisting of two subsets: the first consists of 21 documents written either by the classic pen and paper" approach (and later digitized) and directly acquired on common devices such as tablets.
Preliminary results on the proposed datasets show that 90% classification accuracy can be achieved on the first subset.
arXiv Detail & Related papers (2024-01-09T09:25:58Z) - Stellar: Systematic Evaluation of Human-Centric Personalized
Text-to-Image Methods [52.806258774051216]
We focus on text-to-image systems that input a single image of an individual and ground the generation process along with text describing the desired visual context.
We introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available.
We derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA.
arXiv Detail & Related papers (2023-12-11T04:47:39Z) - NumHG: A Dataset for Number-Focused Headline Generation [28.57003500212883]
Headline generation, a key task in abstractive summarization, strives to condense a full-length article into a succinct, single line of text.
We introduce a new dataset, the NumHG, and provide over 27,000 annotated numeral-rich news articles for detailed investigation.
We evaluate five well-performing models from previous headline generation tasks using human evaluation in terms of numerical accuracy, reasonableness, and readability.
arXiv Detail & Related papers (2023-09-04T09:03:53Z) - Challenging the Myth of Graph Collaborative Filtering: a Reasoned and Reproducibility-driven Analysis [50.972595036856035]
We present a code that successfully replicates results from six popular and recent graph recommendation models.
We compare these graph models with traditional collaborative filtering models that historically performed well in offline evaluations.
By investigating the information flow from users' neighborhoods, we aim to identify which models are influenced by intrinsic features in the dataset structure.
arXiv Detail & Related papers (2023-08-01T09:31:44Z) - Sampling and Ranking for Digital Ink Generation on a tight computational
budget [69.15275423815461]
We study ways to maximize the quality of the output of a trained digital ink generative model.
We use and compare the effect of multiple sampling and ranking techniques, in the first ablation study of its kind in the digital ink domain.
arXiv Detail & Related papers (2023-06-02T09:55:15Z) - How to Choose Pretrained Handwriting Recognition Models for Single
Writer Fine-Tuning [23.274139396706264]
Recent advancements in Deep Learning-based Handwritten Text Recognition (HTR) have led to models with remarkable performance on modern and historical manuscripts.
Those models struggle to obtain the same performance when applied to manuscripts with peculiar characteristics, such as language, paper support, ink, and author handwriting.
In this paper, we take into account large, real benchmark datasets and synthetic ones obtained with a styled Handwritten Text Generation model.
We give a quantitative indication of the most relevant characteristics of such data for obtaining an HTR model able to effectively transcribe manuscripts in small collections with as little as five real fine-tuning lines
arXiv Detail & Related papers (2023-05-04T07:00:28Z) - The Learnable Typewriter: A Generative Approach to Text Analysis [17.355857281085164]
We present a generative document-specific approach to character analysis and recognition in text lines.
Taking as input a set of text lines with similar font or handwriting, our approach can learn a large number of different characters.
arXiv Detail & Related papers (2023-02-03T11:17:59Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Improving Accuracy and Explainability of Online Handwriting Recognition [0.9176056742068814]
We develop handwriting recognition models on the OnHW-chars dataset and improve the accuracy of previous models.
Our results are verifiable and reproducible via the provided public repository.
arXiv Detail & Related papers (2022-09-14T21:28:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.