Self-supervised Implicit Glyph Attention for Text Recognition
- URL: http://arxiv.org/abs/2203.03382v4
- Date: Mon, 15 May 2023 09:58:38 GMT
- Title: Self-supervised Implicit Glyph Attention for Text Recognition
- Authors: Tongkun Guan, Chaochen Gu, Jingzheng Tu, Xue Yang, Qi Feng, Yudi Zhao,
Xiaokang Yang, Wei Shen
- Abstract summary: We propose a novel attention mechanism for scene text recognition (STR) methods, self-supervised implicit glyph attention (SIGA)
SIGA delineates the glyph structures of text images by jointly self-supervised text segmentation and implicit attention alignment.
Experimental results demonstrate that SIGA performs consistently and significantly better than previous attention-based STR methods.
- Score: 52.68772018871633
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The attention mechanism has become the \emph{de facto} module in scene text
recognition (STR) methods, due to its capability of extracting character-level
representations. These methods can be summarized into implicit attention based
and supervised attention based, depended on how the attention is computed,
i.e., implicit attention and supervised attention are learned from
sequence-level text annotations and or character-level bounding box
annotations, respectively. Implicit attention, as it may extract coarse or even
incorrect spatial regions as character attention, is prone to suffering from an
alignment-drifted issue. Supervised attention can alleviate the above issue,
but it is character category-specific, which requires extra laborious
character-level bounding box annotations and would be memory-intensive when
handling languages with larger character categories. To address the
aforementioned issues, we propose a novel attention mechanism for STR,
self-supervised implicit glyph attention (SIGA). SIGA delineates the glyph
structures of text images by jointly self-supervised text segmentation and
implicit attention alignment, which serve as the supervision to improve
attention correctness without extra character-level annotations. Experimental
results demonstrate that SIGA performs consistently and significantly better
than previous attention-based STR methods, in terms of both attention
correctness and final recognition performance on publicly available context
benchmarks and our contributed contextless benchmarks.
Related papers
- Spatial Action Unit Cues for Interpretable Deep Facial Expression Recognition [55.97779732051921]
State-of-the-art classifiers for facial expression recognition (FER) lack interpretability, an important feature for end-users.
A new learning strategy is proposed to explicitly incorporate AU cues into classifier training, allowing to train deep interpretable models.
Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time.
arXiv Detail & Related papers (2024-10-01T10:42:55Z) - Attention Guidance Mechanism for Handwritten Mathematical Expression
Recognition [20.67011291281534]
Handwritten mathematical expression recognition (HMER) is challenging in image-to-text tasks due to the complex layouts of mathematical expressions.
We propose an attention guidance mechanism to explicitly suppress attention weights in irrelevant areas and enhance the appropriate ones.
Our method outperforms existing state-of-the-art methods, achieving expression recognition rates of 60.75% / 61.81% / 63.30% on the CROHME 2014/ 2016/ 2019 datasets.
arXiv Detail & Related papers (2024-03-04T06:22:17Z) - Weakly-Supervised Text Instance Segmentation [44.20745377169349]
We take the first attempt to perform weakly-supervised text instance segmentation by bridging text recognition and text segmentation.
The proposed method significantly outperforms weakly-supervised instance segmentation methods on ICDAR13-FST (18.95$%$ improvement) and TextSeg (17.80$%$ improvement) benchmarks.
arXiv Detail & Related papers (2023-03-20T03:56:47Z) - Self-supervised Character-to-Character Distillation for Text Recognition [54.12490492265583]
We propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate text representation learning.
CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution.
arXiv Detail & Related papers (2022-11-01T05:48:18Z) - On the Locality of Attention in Direct Speech Translation [0.1749935196721634]
Transformers have achieved state-of-the-art results across multiple NLP tasks.
We discuss the usefulness of self-attention for Direct Speech Translation.
arXiv Detail & Related papers (2022-04-19T17:43:37Z) - Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head.
It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention.
On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z) - PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering
Network [54.03560668182197]
We propose a novel fully convolutional Point Gathering Network (PGNet) for reading arbitrarily-shaped text in real-time.
With a PG-CTC decoder, we gather high-level character classification vectors from two-dimensional space and decode them into text symbols without NMS and RoI operations.
Experiments prove that the proposed method achieves competitive accuracy, meanwhile significantly improving the running speed.
arXiv Detail & Related papers (2021-04-12T13:27:34Z) - MANGO: A Mask Attention Guided One-Stage Scene Text Spotter [41.66707532607276]
We propose a novel Mask AttentioN Guided One-stage text spotting framework named MANGO.
The proposed method achieves competitive and even new state-of-the-art performance on both regular and irregular text spotting benchmarks.
arXiv Detail & Related papers (2020-12-08T10:47:49Z) - Boost Image Captioning with Knowledge Reasoning [10.733743535624509]
We propose word attention to improve the correctness of visual attention when generating sequential descriptions word-by-word.
We introduce a new strategy to inject external knowledge extracted from knowledge graph into the encoder-decoder framework to facilitate meaningful captioning.
arXiv Detail & Related papers (2020-11-02T12:19:46Z) - Salience Estimation with Multi-Attention Learning for Abstractive Text
Summarization [86.45110800123216]
In the task of text summarization, salience estimation for words, phrases or sentences is a critical component.
We propose a Multi-Attention Learning framework which contains two new attention learning components for salience estimation.
arXiv Detail & Related papers (2020-04-07T02:38:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.