Multi-Granularity Prediction with Learnable Fusion for Scene Text
Recognition
- URL: http://arxiv.org/abs/2307.13244v1
- Date: Tue, 25 Jul 2023 04:12:50 GMT
- Title: Multi-Granularity Prediction with Learnable Fusion for Scene Text
Recognition
- Authors: Cheng Da, Peng Wang, Cong Yao
- Abstract summary: Scene text recognition (STR) has been an active research topic in computer vision for years.
To tackle this tough problem, numerous innovative methods have been proposed, and incorporating linguistic knowledge into STR models has recently become a prominent trend.
In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet functionally powerful vision STR model.
It already outperforms most previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods.
- Score: 20.48454415635795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the enormous technical challenges and wide range of applications,
scene text recognition (STR) has been an active research topic in computer
vision for years. To tackle this tough problem, numerous innovative methods
have been successively proposed, and incorporating linguistic knowledge into
STR models has recently become a prominent trend. In this work, we first draw
inspiration from the recent progress in Vision Transformer (ViT) to construct a
conceptually simple yet functionally powerful vision STR model, which is built
upon ViT and a tailored Adaptive Addressing and Aggregation (A$^3$) module. It
already outperforms most previous state-of-the-art models for scene text
recognition, including both pure vision models and language-augmented methods.
To integrate linguistic knowledge, we further propose a Multi-Granularity
Prediction strategy to inject information from the language modality into the
model in an implicit way, \ie, subword representations (BPE and WordPiece)
widely used in NLP are introduced into the output space, in addition to the
conventional character level representation, while no independent language
model (LM) is adopted. To produce the final recognition results, two strategies
for effectively fusing the multi-granularity predictions are devised. The
resultant algorithm (termed MGP-STR) is able to push the performance envelope
of STR to an even higher level. Specifically, MGP-STR achieves an average
recognition accuracy of $94\%$ on standard benchmarks for scene text
recognition. Moreover, it also achieves state-of-the-art results on widely-used
handwritten benchmarks as well as more challenging scene text datasets,
demonstrating the generality of the proposed MGP-STR algorithm. The source code
and models will be available at:
\url{https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR}.
Related papers
- Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Sequential Visual and Semantic Consistency for Semi-supervised Text
Recognition [56.968108142307976]
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training.
Most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models.
This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects.
arXiv Detail & Related papers (2024-02-24T13:00:54Z) - Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer [32.657218195756414]
Scene text recognition (STR) in the wild frequently encounters challenges when coping with domain variations, font diversity, shape deformations, etc.
We introduce E$2$STR, a STR model trained with context-rich scene text sequences, where the sequences are generated via our proposed in-context training strategy.
E$2$STR exhibits remarkable training-free adaptation in various scenarios and outperforms even the fine-tuned state-of-the-art approaches on public benchmarks.
arXiv Detail & Related papers (2023-11-22T02:46:57Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - Multi-Granularity Prediction for Scene Text Recognition [20.48454415635795]
Scene text recognition (STR) has been an active research topic in computer vision for years.
We first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet powerful vision STR model.
We propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way.
The resultant algorithm (termed MGP-STR) is able to push the performance envelop of STR to an even higher level.
arXiv Detail & Related papers (2022-09-08T06:43:59Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Multimodal Conditionality for Natural Language Generation [0.0]
MAnTiS is a general approach for multimodal conditionality in transformer-based Natural Language Generation models.
We apply MAnTiS to the task of product description generation, conditioning a network on both product images and titles to generate descriptive text.
arXiv Detail & Related papers (2021-09-02T22:06:07Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and
Intra-modal Knowledge Integration [48.01536973731182]
We introduce a new vision-and-language pretraining method called ROSITA.
It integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments.
ROSITA significantly outperforms existing state-of-the-art methods on three typical vision-and-language tasks over six benchmark datasets.
arXiv Detail & Related papers (2021-08-16T13:16:58Z) - Progressive Generation of Long Text with Pretrained Language Models [83.62523163717448]
Large-scale language models (LMs) pretrained on massive corpora of text, such as GPT-2, are powerful open-domain text generators.
It is still challenging for such models to generate coherent long passages of text, especially when the models are fine-tuned to the target domain on a small corpus.
We propose a simple but effective method of generating text in a progressive manner, inspired by generating images from low to high resolution.
arXiv Detail & Related papers (2020-06-28T21:23:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.