Adaptive Text Recognition through Visual Matching
- URL: http://arxiv.org/abs/2009.06610v1
- Date: Mon, 14 Sep 2020 17:48:53 GMT
- Title: Adaptive Text Recognition through Visual Matching
- Authors: Chuhan Zhang, Ankush Gupta, Andrew Zisserman
- Abstract summary: We introduce a new model that exploits the repetitive nature of characters in languages.
By doing this, we turn text recognition into a shape matching problem.
We show that it can handle challenges that traditional architectures are not able to solve without expensive retraining.
- Score: 86.40870804449737
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, our objective is to address the problems of generalization and
flexibility for text recognition in documents. We introduce a new model that
exploits the repetitive nature of characters in languages, and decouples the
visual representation learning and linguistic modelling stages. By doing this,
we turn text recognition into a shape matching problem, and thereby achieve
generalization in appearance and flexibility in classes. We evaluate the new
model on both synthetic and real datasets across different alphabets and show
that it can handle challenges that traditional architectures are not able to
solve without expensive retraining, including: (i) it can generalize to unseen
fonts without new exemplars from them; (ii) it can flexibly change the number
of classes, simply by changing the exemplars provided; and (iii) it can
generalize to new languages and new characters that it has not been trained for
by providing a new glyph set. We show significant improvements over
state-of-the-art models for all these cases.
Related papers
- Harnessing the Intrinsic Knowledge of Pretrained Language Models for Challenging Text Classification Settings [5.257719744958367]
This thesis explores three challenging settings in text classification by leveraging the intrinsic knowledge of pretrained language models (PLMs)
We develop models that utilize features based on contextualized word representations from PLMs, achieving performance that rivals or surpasses human accuracy.
Lastly, we tackle the sensitivity of large language models to in-context learning prompts by selecting effective demonstrations.
arXiv Detail & Related papers (2024-08-28T09:07:30Z) - We're Calling an Intervention: Exploring the Fundamental Hurdles in Adapting Language Models to Nonstandard Text [8.956635443376527]
We present a suite of experiments that allow us to understand the underlying challenges of language model adaptation to nonstandard text.
We do so by designing interventions that approximate several types of linguistic variation and their interactions with existing biases of language models.
Applying our interventions during language model adaptation with varying size and nature of training data, we gain important insights into when knowledge transfer can be successful.
arXiv Detail & Related papers (2024-04-10T18:56:53Z) - Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - Text-driven Prompt Generation for Vision-Language Models in Federated
Learning [24.005620820818756]
Our work proposes Federated Text-driven Prompt Generation (FedTPG)
FedTPG learns a unified prompt generation network across multiple remote clients in a scalable manner.
Our comprehensive empirical evaluations on nine diverse image classification datasets show that our method is superior to existing federated prompt learning methods.
arXiv Detail & Related papers (2023-10-09T19:57:24Z) - Learning to Name Classes for Vision and Language Models [57.0059455405424]
Large scale vision and language models can achieve impressive zero-shot recognition performance by mapping class specific text queries to image content.
We propose to leverage available data to learn, for each class, an optimal word embedding as a function of the visual content.
By learning new word embeddings on an otherwise frozen model, we are able to retain zero-shot capabilities for new classes, easily adapt models to new datasets, and adjust potentially erroneous, non-descriptive or ambiguous class names.
arXiv Detail & Related papers (2023-04-04T14:34:44Z) - Towards Multimodal Vision-Language Models Generating Non-Generic Text [2.102846336724103]
Vision-language models can assess visual context in an image and generate descriptive text.
Recent work has used optical character recognition to supplement visual information with text extracted from an image.
In this work, we contend that vision-language models can benefit from additional information that can be extracted from an image, but are not used by current models.
arXiv Detail & Related papers (2022-07-09T01:56:35Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Towards Open-Set Text Recognition via Label-to-Prototype Learning [18.06730376866086]
We propose a label-to-prototype learning framework to handle novel characters without retraining the model.
A lot of experiments show that our method achieves promising performance on a variety of zero-shot, close-set, and open-set text recognition datasets.
arXiv Detail & Related papers (2022-03-10T06:22:51Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Scalable Font Reconstruction with Dual Latent Manifolds [55.29525824849242]
We propose a deep generative model that performs typography analysis and font reconstruction.
Our approach enables us to massively scale up the number of character types we can effectively model.
We evaluate on the task of font reconstruction over various datasets representing character types of many languages.
arXiv Detail & Related papers (2021-09-10T20:37:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.