PHD: Pixel-Based Language Modeling of Historical Documents
- URL: http://arxiv.org/abs/2310.18343v2
- Date: Sat, 4 Nov 2023 11:30:01 GMT
- Title: PHD: Pixel-Based Language Modeling of Historical Documents
- Authors: Nadav Borenstein, Phillip Rust, Desmond Elliott, Isabelle Augenstein
- Abstract summary: We propose a novel method for generating synthetic scans to resemble real historical documents.
We pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period.
We successfully apply our model to a historical QA task, highlighting its usefulness in this domain.
- Score: 55.75201940642297
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The digitisation of historical documents has provided historians with
unprecedented research opportunities. Yet, the conventional approach to
analysing historical documents involves converting them from images to text
using OCR, a process that overlooks the potential benefits of treating them as
images and introduces high levels of noise. To bridge this gap, we take
advantage of recent advancements in pixel-based language models trained to
reconstruct masked patches of pixels instead of predicting token distributions.
Due to the scarcity of real historical scans, we propose a novel method for
generating synthetic scans to resemble real historical documents. We then
pre-train our model, PHD, on a combination of synthetic scans and real
historical newspapers from the 1700-1900 period. Through our experiments, we
demonstrate that PHD exhibits high proficiency in reconstructing masked image
patches and provide evidence of our model's noteworthy language understanding
capabilities. Notably, we successfully apply our model to a historical QA task,
highlighting its usefulness in this domain.
Related papers
- Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and
Multi-Source Supervision [5.517240672957627]
We propose a novel knowledge-aware artifact image synthesis approach that brings lost historical objects accurately into their visual forms.
Compared to existing approaches, our proposed model produces higher-quality artifact images that align better with the implicit details and historical knowledge contained within written documents.
arXiv Detail & Related papers (2023-12-13T11:03:07Z) - Blind Dates: Examining the Expression of Temporality in Historical
Photographs [57.07335632641355]
We investigate the dating of images using OpenCLIP, an open-source implementation of CLIP, a multi-modal language and vision model.
We use the textitDe Boer Scene Detection dataset, containing 39,866 gray-scale historical press photographs from 1950 to 1999.
Our analysis reveals that images featuring buses, cars, cats, dogs, and people are more accurately dated, suggesting the presence of temporal markers.
arXiv Detail & Related papers (2023-10-10T13:51:24Z) - Prompt me a Dataset: An investigation of text-image prompting for
historical image dataset creation using foundation models [0.9065034043031668]
We present a pipeline for image extraction from historical documents using foundation models.
We evaluate text-image prompts and their effectiveness on humanities datasets of varying levels of complexity.
arXiv Detail & Related papers (2023-09-04T15:37:03Z) - Continual Face Forgery Detection via Historical Distribution Preserving [88.66313037412846]
We focus on a novel and challenging problem: Continual Face Forgery Detection (CFFD)
CFFD aims to efficiently learn from new forgery attacks without forgetting previous ones.
Our experiments on the benchmarks show that our method outperforms the state-of-the-art competitors.
arXiv Detail & Related papers (2023-08-11T16:37:31Z) - The Effects of Character-Level Data Augmentation on Style-Based Dating
of Historical Manuscripts [5.285396202883411]
This article explores the influence of data augmentation on the dating of historical manuscripts.
Linear Support Vector Machines were trained with k-fold cross-validation on textural and grapheme-based features extracted from historical manuscripts.
Results show that training models with augmented data improve the performance of historical manuscripts dating by 1% - 3% in cumulative scores.
arXiv Detail & Related papers (2022-12-15T15:55:44Z) - Pattern Spotting and Image Retrieval in Historical Documents using Deep
Hashing [60.67014034968582]
This paper presents a deep learning approach for image retrieval and pattern spotting in digital collections of historical documents.
Deep learning models are used for feature extraction, considering two distinct variants, which provide either real-valued or binary code representations.
The proposed approach also reduces the search time by up to 200x and the storage cost up to 6,000x when compared to related works.
arXiv Detail & Related papers (2022-08-04T01:39:37Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Image-based material analysis of ancient historical documents [5.285396202883411]
This study uses images of a famous historical collection, the Dead Sea Scrolls, to propose a novel method to classify the materials of the manuscripts.
A binary classification system employing the transform with a majority voting process is shown to be effective for this classification task.
This pilot study shows a successful classification percentage of up to 97% for a confined amount of manuscripts produced from either parchment or papyrus material.
arXiv Detail & Related papers (2022-03-02T11:39:22Z) - Digital Editions as Distant Supervision for Layout Analysis of Printed
Books [76.29918490722902]
We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models.
In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics.
We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
arXiv Detail & Related papers (2021-12-23T16:51:53Z) - Lacuna Reconstruction: Self-supervised Pre-training for Low-Resource
Historical Document Transcription [25.76860672652937]
We show a meaningful improvement in recognition accuracy over the same supervised model trained from scratch with as few as 30 line image transcriptions for training.
Our masked language model-style pre-training strategy, where the model is trained to be able to identify the true masked visual representation from distractors sampled from within the same line, encourages learning robust contextualized language representations.
arXiv Detail & Related papers (2021-12-16T08:28:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.