Masked and Permuted Implicit Context Learning for Scene Text Recognition
- URL: http://arxiv.org/abs/2305.16172v2
- Date: Wed, 20 Dec 2023 07:10:27 GMT
- Title: Masked and Permuted Implicit Context Learning for Scene Text Recognition
- Authors: Xiaomeng Yang, Zhi Qiao, Jin Wei, Dongbao Yang, Yu Zhou
- Abstract summary: Scene Recognition (STR) is difficult because of variations in text styles, shapes, and backgrounds.
We propose a masked and permuted implicit context learning network for STR, within a single decoder.
- Score: 8.742571493814326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene Text Recognition (STR) is difficult because of the variations in text
styles, shapes, and backgrounds. Though the integration of linguistic
information enhances models' performance, existing methods based on either
permuted language modeling (PLM) or masked language modeling (MLM) have their
pitfalls. PLM's autoregressive decoding lacks foresight into subsequent
characters, while MLM overlooks inter-character dependencies. Addressing these
problems, we propose a masked and permuted implicit context learning network
for STR, which unifies PLM and MLM within a single decoder, inheriting the
advantages of both approaches. We utilize the training procedure of PLM, and to
integrate MLM, we incorporate word length information into the decoding process
and replace the undetermined characters with mask tokens. Besides, perturbation
training is employed to train a more robust model against potential length
prediction errors. Our empirical evaluations demonstrate the performance of our
model. It not only achieves superior performance on the common benchmarks but
also achieves a substantial improvement of $9.1\%$ on the more challenging
Union14M-Benchmark.
Related papers
- PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via Existing MLLM Structures [5.513631883813244]
We propose a framework that textbfPre-textbfIntegratestextbfPrompt information into the visual encoding process using existingmodules of MLLMs.
Our model maintains excellent generation even when half of the visual tokens are reduced.
arXiv Detail & Related papers (2024-10-30T15:05:17Z) - ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual referring into Multimodal Large Language Models (MLLMs)
We observe the relationship between text prompt tokens and visual tokens in MLLMs, where attention layers model the connection between them.
We optimize a learnable visual token based on an energy function, enhancing the strength of referential regions in the attention map.
arXiv Detail & Related papers (2024-07-31T11:40:29Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens [21.61634020256455]
Transformer-based large language models (LLMs) suffer a performance degradation when modeling long-term contexts.
We propose a simple yet effective method to enable LLMs to take a deep breath, encouraging them to summarize information contained within discrete text chunks.
arXiv Detail & Related papers (2024-06-16T15:50:10Z) - Which Syntactic Capabilities Are Statistically Learned by Masked
Language Models for Code? [51.29970742152668]
We highlight relying on accuracy-based measurements may lead to an overestimation of models' capabilities.
To address these issues, we introduce a technique called SyntaxEval in Syntactic Capabilities.
arXiv Detail & Related papers (2024-01-03T02:44:02Z) - LLM-augmented Preference Learning from Natural Language [19.700169351688768]
Large Language Models (LLMs) are equipped to deal with larger context lengths.
LLMs can consistently outperform the SotA when the target text is large.
Few-shot learning yields better performance than zero-shot learning.
arXiv Detail & Related papers (2023-10-12T17:17:27Z) - Representation Deficiency in Masked Language Modeling [107.39136254013042]
We propose MAE-LM, which pretrains the Masked Autoencoder architecture with where $tt[MASK]$ tokens are excluded from the encoder.
We show that MAE-LM consistently outperforms pretrained models across different pretraining settings and model sizes when fine-tuned on the GLUE and SQuAD benchmarks.
arXiv Detail & Related papers (2023-02-04T01:54:17Z) - Learning Better Masking for Better Language Model Pre-training [80.31112722910787]
Masked Language Modeling has been widely used as denoising objective in pre-training language models (PrLMs)
PrLMs commonly adopt a Random-Token Masking strategy where a fixed masking ratio is applied and different contents are masked by an equal probability throughout the entire training.
We propose two scheduled masking approaches that adaptively tune the masking ratio and masked content in different training stages.
arXiv Detail & Related papers (2022-08-23T08:27:52Z) - UniLMv2: Pseudo-Masked Language Models for Unified Language Model
Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks.
Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.