Is it an i or an l: Test-time Adaptation of Text Line Recognition Models
- URL: http://arxiv.org/abs/2308.15037v1
- Date: Tue, 29 Aug 2023 05:44:00 GMT
- Title: Is it an i or an l: Test-time Adaptation of Text Line Recognition Models
- Authors: Debapriya Tula, Sujoy Paul, Gagan Madan, Peter Garst, Reeve Ingle,
Gaurav Aggarwal
- Abstract summary: We introduce the problem of adapting text line recognition models during test time.
We propose an iterative self-training approach that uses feedback from the language model to update the optical model.
Experimental results show that the proposed adaptation method offers an absolute improvement of up to 8% in character error rate.
- Score: 9.149602257966917
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recognizing text lines from images is a challenging problem, especially for
handwritten documents due to large variations in writing styles. While text
line recognition models are generally trained on large corpora of real and
synthetic data, such models can still make frequent mistakes if the handwriting
is inscrutable or the image acquisition process adds corruptions, such as
noise, blur, compression, etc. Writing style is generally quite consistent for
an individual, which can be leveraged to correct mistakes made by such models.
Motivated by this, we introduce the problem of adapting text line recognition
models during test time. We focus on a challenging and realistic setting where,
given only a single test image consisting of multiple text lines, the task is
to adapt the model such that it performs better on the image, without any
labels. We propose an iterative self-training approach that uses feedback from
the language model to update the optical model, with confident self-labels in
each iteration. The confidence measure is based on an augmentation mechanism
that evaluates the divergence of the prediction of the model in a local region.
We perform rigorous evaluation of our method on several benchmark datasets as
well as their corrupted versions. Experimental results on multiple datasets
spanning multiple scripts show that the proposed adaptation method offers an
absolute improvement of up to 8% in character error rate with just a few
iterations of self-training at test time.
Related papers
- Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models [16.00576040281808]
We propose a novel framework called Image2Text2Image to evaluate image captioning models.
A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies.
Our framework does not rely on human-annotated captions reference, making it a valuable tool for assessing image captioning models.
arXiv Detail & Related papers (2024-11-08T17:07:01Z) - Debiasing Vison-Language Models with Text-Only Training [15.069736314663352]
We propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
arXiv Detail & Related papers (2024-10-12T04:34:46Z) - Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - LANCE: Stress-testing Visual Models by Generating Language-guided
Counterfactual Images [20.307968197151897]
We propose an automated algorithm to stress-test a trained visual model by generating language-guided counterfactual test images (LANCE)
Our method leverages recent progress in large language modeling and text-based image editing to augment an IID test set with a suite of diverse, realistic, and challenging test images without altering model weights.
arXiv Detail & Related papers (2023-05-30T16:09:16Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - WordStylist: Styled Verbatim Handwritten Text Generation with Latent
Diffusion Models [8.334487584550185]
We present a latent diffusion-based method for styled text-to-text-content-image generation on word-level.
Our proposed method is able to generate realistic word image samples from different writer styles.
We show that the proposed model produces samples that are aesthetically pleasing, help boosting text recognition performance, and get similar writer retrieval score as real data.
arXiv Detail & Related papers (2023-03-29T10:19:26Z) - Aligning Text-to-Image Models using Human Feedback [104.76638092169604]
Current text-to-image models often generate images that are inadequately aligned with text prompts.
We propose a fine-tuning method for aligning such models using human feedback.
Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.
arXiv Detail & Related papers (2023-02-23T17:34:53Z) - ClipCrop: Conditioned Cropping Driven by Vision-Language Model [90.95403416150724]
We take advantage of vision-language models as a foundation for creating robust and user-intentional cropping algorithms.
We develop a method to perform cropping with a text or image query that reflects the user's intention as guidance.
Our pipeline design allows the model to learn text-conditioned aesthetic cropping with a small dataset.
arXiv Detail & Related papers (2022-11-21T14:27:07Z) - Thutmose Tagger: Single-pass neural model for Inverse Text Normalization [76.87664008338317]
Inverse text normalization (ITN) is an essential post-processing step in automatic speech recognition.
We present a dataset preparation method based on the granular alignment of ITN examples.
One-to-one correspondence between tags and input words improves the interpretability of the model's predictions.
arXiv Detail & Related papers (2022-07-29T20:39:02Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - Offline Handwritten Chinese Text Recognition with Convolutional Neural
Networks [5.984124397831814]
In this paper, we build the models using only the convolutional neural networks and use CTC as the loss function.
We achieve 6.81% character error rate (CER) on the ICDAR 2013 competition set, which is the best published result without language model correction.
arXiv Detail & Related papers (2020-06-28T14:34:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.