RVAFM: Re-parameterizing Vertical Attention Fusion Module for Handwritten Paragraph Text Recognition
- URL: http://arxiv.org/abs/2503.03104v1
- Date: Wed, 05 Mar 2025 01:41:59 GMT
- Title: RVAFM: Re-parameterizing Vertical Attention Fusion Module for Handwritten Paragraph Text Recognition
- Authors: Jinhui Zheng, Zhiquan Liu, Yain-Whar Si, Jianqing Li, Xinyuan Zhang, Xiaofan Li, Haozhi Huang, Xueyuan Gong,
- Abstract summary: We propose a new module, named Re- parameterizing Vertical Attention Fusion Module (RVAFM)<n> RVAFM decouples the structure of the module during training and inference stages.<n>We achieve a Character Error Rate (CER) of 4.44% and a Word Error Rate (WER) of 14.37% on the IAM paragraph-level test set.
- Score: 7.559954455228787
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Handwritten Paragraph Text Recognition (HPTR) is a challenging task in Computer Vision, requiring the transformation of a paragraph text image, rich in handwritten text, into text encoding sequences. One of the most advanced models for this task is Vertical Attention Network (VAN), which utilizes a Vertical Attention Module (VAM) to implicitly segment paragraph text images into text lines, thereby reducing the difficulty of the recognition task. However, from a network structure perspective, VAM is a single-branch module, which is less effective in learning compared to multi-branch modules. In this paper, we propose a new module, named Re-parameterizing Vertical Attention Fusion Module (RVAFM), which incorporates structural re-parameterization techniques. RVAFM decouples the structure of the module during training and inference stages. During training, it uses a multi-branch structure for more effective learning, and during inference, it uses a single-branch structure for faster processing. The features learned by the multi-branch structure are fused into the single-branch structure through a special fusion method named Re-parameterization Fusion (RF) without any loss of information. As a result, we achieve a Character Error Rate (CER) of 4.44% and a Word Error Rate (WER) of 14.37% on the IAM paragraph-level test set. Additionally, the inference speed is slightly faster than VAN.
Related papers
- RFL: Simplifying Chemical Structure Recognition with Ring-Free Language [66.47173094346115]
We propose a novel Ring-Free Language (RFL) to describe chemical structures in a hierarchical form.
RFL allows complex molecular structures to be decomposed into multiple parts, ensuring both uniqueness and conciseness.
We propose a universal Molecular Skeleton Decoder (MSD), which comprises a skeleton generation module that progressively predicts the molecular skeleton and individual rings.
arXiv Detail & Related papers (2024-12-10T15:29:32Z) - Leveraging Structure Knowledge and Deep Models for the Detection of Abnormal Handwritten Text [19.05500901000957]
We propose a two-stage detection algorithm that combines structure knowledge and deep models for handwritten text.
A shape regression network trained by a novel semi-supervised contrast training strategy is introduced and the positional relationship between the characters is fully employed.
Experiments on two handwritten text datasets show that the proposed method can greatly improve the detection performance.
arXiv Detail & Related papers (2024-10-15T14:57:10Z) - Out of Length Text Recognition with Sub-String Matching [54.63761108308825]
In this paper, we term this task Out of Length (OOL) text recognition.<n>We propose a novel method called OOL Text Recognition with sub-String Matching (SMTR)<n>SMTR comprises two cross-attention-based modules: one encodes a sub-string containing multiple characters into next and previous queries, and the other employs the queries to attend to the image features.
arXiv Detail & Related papers (2024-07-17T05:02:17Z) - FETNet: Feature Erasing and Transferring Network for Scene Text Removal [14.763369952265796]
Scene text removal (STR) task aims to remove text regions and recover the background smoothly in images for private information protection.
Most existing STR methods adopt encoder-decoder-based CNNs, with direct copies of the features in the skip connections.
We propose a novel Feature Erasing and Transferring (FET) mechanism to reconfigure the encoded features for STR.
arXiv Detail & Related papers (2023-06-16T02:38:30Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z) - Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.
SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it.
Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z) - AutoSTR: Efficient Backbone Search for Scene Text Recognition [80.7290173000068]
Scene text recognition (STR) is very challenging due to the diversity of text instances and the complexity of scenes.
We propose automated STR (AutoSTR) to search data-dependent backbones to boost text recognition performance.
Experiments demonstrate that, by searching data-dependent backbones, AutoSTR can outperform the state-of-the-art approaches on standard benchmarks.
arXiv Detail & Related papers (2020-03-14T06:51:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.