Lacuna Reconstruction: Self-supervised Pre-training for Low-Resource
Historical Document Transcription
- URL: http://arxiv.org/abs/2112.08692v1
- Date: Thu, 16 Dec 2021 08:28:26 GMT
- Title: Lacuna Reconstruction: Self-supervised Pre-training for Low-Resource
Historical Document Transcription
- Authors: Nikolai Vogler, Jonathan Parkes Allen, Matthew Thomas Miller, Taylor
Berg-Kirkpatrick
- Abstract summary: We show a meaningful improvement in recognition accuracy over the same supervised model trained from scratch with as few as 30 line image transcriptions for training.
Our masked language model-style pre-training strategy, where the model is trained to be able to identify the true masked visual representation from distractors sampled from within the same line, encourages learning robust contextualized language representations.
- Score: 25.76860672652937
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a self-supervised pre-training approach for learning rich visual
language representations for both handwritten and printed historical document
transcription. After supervised fine-tuning of our pre-trained encoder
representations for low-resource document transcription on two languages, (1) a
heterogeneous set of handwritten Islamicate manuscript images and (2) early
modern English printed documents, we show a meaningful improvement in
recognition accuracy over the same supervised model trained from scratch with
as few as 30 line image transcriptions for training. Our masked language
model-style pre-training strategy, where the model is trained to be able to
identify the true masked visual representation from distractors sampled from
within the same line, encourages learning robust contextualized language
representations invariant to scribal writing style and printing noise present
across documents.
Related papers
- Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training [68.41837295318152]
Diffusion-based text-to-image models have demonstrated impressive achievements in diversity and aesthetics but struggle to generate images with visual texts.
Existing backbone models have limitations such as misspelling, failing to generate texts, and lack of support for Chinese text.
We propose a series of methods, aiming to empower backbone models to generate visual texts in English and Chinese.
arXiv Detail & Related papers (2024-10-06T10:25:39Z) - Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - Self-Supervised Representation Learning for Online Handwriting Text
Classification [0.8594140167290099]
We propose the novel Part of Stroke Masking (POSM) as a pretext task for pretraining models to extract informative representations from the online handwriting of individuals in English and Chinese languages.
To evaluate the quality of the extracted representations, we use both intrinsic and extrinsic evaluation methods.
The pretrained models are fine-tuned to achieve state-of-the-art results in tasks such as writer identification, gender classification, and handedness classification.
arXiv Detail & Related papers (2023-10-10T14:07:49Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - Weakly Supervised Scene Text Generation for Low-resource Languages [19.243705770491577]
A large number of annotated training images is crucial for training successful scene text recognition models.
Existing scene text generation methods typically rely on a large amount of paired data, which is difficult to obtain for low-resource languages.
We propose a novel weakly supervised scene text generation method that leverages a few recognition-level labels as weak supervision.
arXiv Detail & Related papers (2023-06-25T15:26:06Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - SelfDoc: Self-Supervised Document Representation Learning [46.22910270334824]
SelfDoc is a task-agnostic pre-training framework for document image understanding.
Our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document.
It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
arXiv Detail & Related papers (2021-06-07T04:19:49Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.