Related papers: Adaptive Text Recognition through Visual Matching

Adaptive Text Recognition through Visual Matching

URL: http://arxiv.org/abs/2009.06610v1
Date: Mon, 14 Sep 2020 17:48:53 GMT
Title: Adaptive Text Recognition through Visual Matching
Authors: Chuhan Zhang, Ankush Gupta, Andrew Zisserman
Abstract summary: We introduce a new model that exploits the repetitive nature of characters in languages. By doing this, we turn text recognition into a shape matching problem. We show that it can handle challenges that traditional architectures are not able to solve without expensive retraining.
Score: 86.40870804449737
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, our objective is to address the problems of generalization and flexibility for text recognition in documents. We introduce a new model that exploits the repetitive nature of characters in languages, and decouples the visual representation learning and linguistic modelling stages. By doing this, we turn text recognition into a shape matching problem, and thereby achieve generalization in appearance and flexibility in classes. We evaluate the new model on both synthetic and real datasets across different alphabets and show that it can handle challenges that traditional architectures are not able to solve without expensive retraining, including: (i) it can generalize to unseen fonts without new exemplars from them; (ii) it can flexibly change the number of classes, simply by changing the exemplars provided; and (iii) it can generalize to new languages and new characters that it has not been trained for by providing a new glyph set. We show significant improvements over state-of-the-art models for all these cases.

Related papers

Classifying the Unknown: In-Context Learning for Open-Vocabulary Text and Symbol Recognition [3.667678728817253]
We introduce Rosetta, a multimodal model that leverages Multimodal In-Context Learning (MICL) to classify sequences of novel script patterns in documents. A key strength of our method is the use of a Context-Aware Tokenizer (CAT), which enables open-vocabulary classification. As a result, it unlocks applications such as the recognition of new alphabets and languages.
arXiv Detail & Related papers (2025-04-09T12:58:25Z)
Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation [70.95783968368124]
We introduce a novel multi-modal autoregressive model, dubbed $textbfInstaManip$. We propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages. Our method surpasses previous few-shot image manipulation models by a notable margin.
arXiv Detail & Related papers (2024-12-02T01:19:21Z)
Harnessing the Intrinsic Knowledge of Pretrained Language Models for Challenging Text Classification Settings [5.257719744958367]
This thesis explores three challenging settings in text classification by leveraging the intrinsic knowledge of pretrained language models (PLMs) We develop models that utilize features based on contextualized word representations from PLMs, achieving performance that rivals or surpasses human accuracy. Lastly, we tackle the sensitivity of large language models to in-context learning prompts by selecting effective demonstrations.
arXiv Detail & Related papers (2024-08-28T09:07:30Z)
We're Calling an Intervention: Exploring the Fundamental Hurdles in Adapting Language Models to Nonstandard Text [8.956635443376527]
We present a suite of experiments that allow us to understand the underlying challenges of language model adaptation to nonstandard text. We do so by designing interventions that approximate several types of linguistic variation and their interactions with existing biases of language models. Applying our interventions during language model adaptation with varying size and nature of training data, we gain important insights into when knowledge transfer can be successful.
arXiv Detail & Related papers (2024-04-10T18:56:53Z)
Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation. We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z)
Text-driven Prompt Generation for Vision-Language Models in Federated Learning [24.005620820818756]
Our work proposes Federated Text-driven Prompt Generation (FedTPG) FedTPG learns a unified prompt generation network across multiple remote clients in a scalable manner. Our comprehensive empirical evaluations on nine diverse image classification datasets show that our method is superior to existing federated prompt learning methods.
arXiv Detail & Related papers (2023-10-09T19:57:24Z)
Learning to Name Classes for Vision and Language Models [57.0059455405424]
Large scale vision and language models can achieve impressive zero-shot recognition performance by mapping class specific text queries to image content. We propose to leverage available data to learn, for each class, an optimal word embedding as a function of the visual content. By learning new word embeddings on an otherwise frozen model, we are able to retain zero-shot capabilities for new classes, easily adapt models to new datasets, and adjust potentially erroneous, non-descriptive or ambiguous class names.
arXiv Detail & Related papers (2023-04-04T14:34:44Z)
Towards Multimodal Vision-Language Models Generating Non-Generic Text [2.102846336724103]
Vision-language models can assess visual context in an image and generate descriptive text. Recent work has used optical character recognition to supplement visual information with text extracted from an image. In this work, we contend that vision-language models can benefit from additional information that can be extracted from an image, but are not used by current models.
arXiv Detail & Related papers (2022-07-09T01:56:35Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
Towards Open-Set Text Recognition via Label-to-Prototype Learning [18.06730376866086]
We propose a label-to-prototype learning framework to handle novel characters without retraining the model. A lot of experiments show that our method achieves promising performance on a variety of zero-shot, close-set, and open-set text recognition datasets.
arXiv Detail & Related papers (2022-03-10T06:22:51Z)
How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text. Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions? We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z)
Scalable Font Reconstruction with Dual Latent Manifolds [55.29525824849242]
We propose a deep generative model that performs typography analysis and font reconstruction. Our approach enables us to massively scale up the number of character types we can effectively model. We evaluate on the task of font reconstruction over various datasets representing character types of many languages.
arXiv Detail & Related papers (2021-09-10T20:37:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.