Masked Self-Supervised Pre-Training for Text Recognition Transformers on Large-Scale Datasets
- URL: http://arxiv.org/abs/2503.22513v1
- Date: Fri, 28 Mar 2025 15:16:48 GMT
- Title: Masked Self-Supervised Pre-Training for Text Recognition Transformers on Large-Scale Datasets
- Authors: Martin Kišš, Michal Hradiš,
- Abstract summary: Self-supervised learning has emerged as a powerful approach for leveraging large-scale unlabeled data to improve model performance.<n>We propose two modifications to the pre-training phase: progressively increasing the masking probability, and modifying the loss function to incorporate both masked and non-masked patches.<n>We conduct extensive experiments using a dataset of 50M unlabeled text lines for pre-training and four differently sized annotated datasets for fine-tuning.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning has emerged as a powerful approach for leveraging large-scale unlabeled data to improve model performance in various domains. In this paper, we explore masked self-supervised pre-training for text recognition transformers. Specifically, we propose two modifications to the pre-training phase: progressively increasing the masking probability, and modifying the loss function to incorporate both masked and non-masked patches. We conduct extensive experiments using a dataset of 50M unlabeled text lines for pre-training and four differently sized annotated datasets for fine-tuning. Furthermore, we compare our pre-trained models against those trained with transfer learning, demonstrating the effectiveness of the self-supervised pre-training. In particular, pre-training consistently improves the character error rate of models, in some cases up to 30 % relatively. It is also on par with transfer learning but without relying on extra annotated text lines.
Related papers
- AdaViT: Adaptive Vision Transformer for Flexible Pretrain and Finetune with Variable 3D Medical Image Modalities [9.006543373916314]
We propose an adaptive Vision Transformer (AdaViT) framework capable of handling variable set of input modalities for each case.
We demonstrate that this architecture effectively transfers supervised pretrained models to new datasets with different input modality/contrast sets.
For self-supervised pretrain, the proposed method is able to maximize the pretrain data and facilitate transferring to diverse downstream tasks with variable sets of input modalities.
arXiv Detail & Related papers (2025-04-04T16:57:06Z) - Efficient Transferability Assessment for Selection of Pre-trained Detectors [63.21514888618542]
This paper studies the efficient transferability assessment of pre-trained object detectors.
We build up a detector transferability benchmark which contains a large and diverse zoo of pre-trained detectors.
Experimental results demonstrate that our method outperforms other state-of-the-art approaches in assessing transferability.
arXiv Detail & Related papers (2024-03-14T14:23:23Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z) - On the Effect of Pre-training for Transformer in Different Modality on
Offline Reinforcement Learning [0.0]
We investigate how pre-training on data of different modalities, such as language and vision, affects fine-tuning of Transformer-based models to Mujoco offline reinforcement learning tasks.
arXiv Detail & Related papers (2022-11-17T13:34:08Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Robust Transfer Learning with Pretrained Language Models through
Adapters [40.45102278979193]
Transfer learning with large pretrained language models like BERT has become a dominating approach for most NLP tasks.
We propose a simple yet effective adapter-based approach to mitigate these issues.
Our experiments demonstrate that such a training scheme leads to improved stability and adversarial robustness in transfer learning to various downstream tasks.
arXiv Detail & Related papers (2021-08-05T02:30:13Z) - How Effective is Task-Agnostic Data Augmentation for Pretrained
Transformers? [7.727662147015879]
Task-agnostic forms of data augmentation have proven widely effective in computer vision, even on pretrained models.
We ask how effective these techniques really are when applied to pretrained transformers.
We observe a negative result, finding that techniques which previously reported strong improvements for non-pretrained models fail to consistently improve performance for pretrained transformers.
arXiv Detail & Related papers (2020-10-05T03:55:15Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z) - Improving Semantic Segmentation via Self-Training [75.07114899941095]
We show that we can obtain state-of-the-art results using a semi-supervised approach, specifically a self-training paradigm.
We first train a teacher model on labeled data, and then generate pseudo labels on a large set of unlabeled data.
Our robust training framework can digest human-annotated and pseudo labels jointly and achieve top performances on Cityscapes, CamVid and KITTI datasets.
arXiv Detail & Related papers (2020-04-30T17:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.