Multi-Granularity Prediction for Scene Text Recognition
- URL: http://arxiv.org/abs/2209.03592v1
- Date: Thu, 8 Sep 2022 06:43:59 GMT
- Title: Multi-Granularity Prediction for Scene Text Recognition
- Authors: Peng Wang, Cheng Da, Cong Yao
- Abstract summary: Scene text recognition (STR) has been an active research topic in computer vision for years.
We first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet powerful vision STR model.
We propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way.
The resultant algorithm (termed MGP-STR) is able to push the performance envelop of STR to an even higher level.
- Score: 20.48454415635795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text recognition (STR) has been an active research topic in computer
vision for years. To tackle this challenging problem, numerous innovative
methods have been successively proposed and incorporating linguistic knowledge
into STR models has recently become a prominent trend. In this work, we first
draw inspiration from the recent progress in Vision Transformer (ViT) to
construct a conceptually simple yet powerful vision STR model, which is built
upon ViT and outperforms previous state-of-the-art models for scene text
recognition, including both pure vision models and language-augmented methods.
To integrate linguistic knowledge, we further propose a Multi-Granularity
Prediction strategy to inject information from the language modality into the
model in an implicit way, i.e. , subword representations (BPE and WordPiece)
widely-used in NLP are introduced into the output space, in addition to the
conventional character level representation, while no independent language
model (LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push
the performance envelop of STR to an even higher level. Specifically, it
achieves an average recognition accuracy of 93.35% on standard benchmarks. Code
will be released soon.
Related papers
- Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter [21.45490901191175]
PaLM2-VAdapter employs a progressively aligned language model as the vision-language adapter.
Our method achieves these advancements with 3070% fewer parameters than the state-of-the-art large vision-language models.
arXiv Detail & Related papers (2024-02-16T18:54:47Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - ViLTA: Enhancing Vision-Language Pre-training through Textual
Augmentation [35.05755930636518]
We propose ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs.
For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model.
For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of language input.
arXiv Detail & Related papers (2023-08-31T12:46:36Z) - Multi-Granularity Prediction with Learnable Fusion for Scene Text
Recognition [20.48454415635795]
Scene text recognition (STR) has been an active research topic in computer vision for years.
To tackle this tough problem, numerous innovative methods have been proposed, and incorporating linguistic knowledge into STR models has recently become a prominent trend.
In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet functionally powerful vision STR model.
It already outperforms most previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods.
arXiv Detail & Related papers (2023-07-25T04:12:50Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Multimodal Conditionality for Natural Language Generation [0.0]
MAnTiS is a general approach for multimodal conditionality in transformer-based Natural Language Generation models.
We apply MAnTiS to the task of product description generation, conditioning a network on both product images and titles to generate descriptive text.
arXiv Detail & Related papers (2021-09-02T22:06:07Z) - ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and
Intra-modal Knowledge Integration [48.01536973731182]
We introduce a new vision-and-language pretraining method called ROSITA.
It integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments.
ROSITA significantly outperforms existing state-of-the-art methods on three typical vision-and-language tasks over six benchmark datasets.
arXiv Detail & Related papers (2021-08-16T13:16:58Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.