CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition
- URL: http://arxiv.org/abs/2401.10041v1
- Date: Thu, 18 Jan 2024 15:05:57 GMT
- Title: CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition
- Authors: Jinzhi Zheng, Ruyi Ji, Libo Zhang, Yanjun Wu, Chen Zhao
- Abstract summary: We propose a novel cross-modal fusion network (CMFN) for irregular scene text recognition.
CMFN consists of a position self-enhanced encoder, a visual recognition branch and an iterative semantic recognition branch.
experiments demonstrate that the proposed CMFN algorithm achieves comparable performance to state-of-the-art algorithms.
- Score: 22.13675752628
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scene text recognition, as a cross-modal task involving vision and text, is
an important research topic in computer vision. Most existing methods use
language models to extract semantic information for optimizing visual
recognition. However, the guidance of visual cues is ignored in the process of
semantic mining, which limits the performance of the algorithm in recognizing
irregular scene text. To tackle this issue, we propose a novel cross-modal
fusion network (CMFN) for irregular scene text recognition, which incorporates
visual cues into the semantic mining process. Specifically, CMFN consists of a
position self-enhanced encoder, a visual recognition branch and an iterative
semantic recognition branch. The position self-enhanced encoder provides
character sequence position encoding for both the visual recognition branch and
the iterative semantic recognition branch. The visual recognition branch
carries out visual recognition based on the visual features extracted by CNN
and the position encoding information provided by the position self-enhanced
encoder. The iterative semantic recognition branch, which consists of a
language recognition module and a cross-modal fusion gate, simulates the way
that human recognizes scene text and integrates cross-modal visual cues for
text recognition. The experiments demonstrate that the proposed CMFN algorithm
achieves comparable performance to state-of-the-art algorithms, indicating its
effectiveness.
Related papers
- UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model.
We show that UNIT significantly outperforms existing methods on document-related tasks.
Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z) - LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model [20.007650672107566]
Video text spotting (VTS) aims to simultaneously localize, recognize and track text instances in videos.
Recent methods track the zero-shot results of state-of-the-art image text spotters directly.
Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements.
arXiv Detail & Related papers (2024-05-29T15:35:09Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - Primitive Representation Learning for Scene Text Recognition [7.818765015637802]
We propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images.
A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding.
We also propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods.
arXiv Detail & Related papers (2021-05-10T11:54:49Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text
Recognition [17.191496890376197]
We propose a semantics enhanced encoder-decoder framework to robustly recognize low-quality scene texts.
The proposed framework is more robust for low-quality text images, and achieves state-of-the-art results on several benchmark datasets.
arXiv Detail & Related papers (2020-05-22T03:02:46Z) - Towards Accurate Scene Text Recognition with Semantic Reasoning Networks [52.86058031919856]
We propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition.
GSRM is introduced to capture global semantic context through multi-way parallel transmission.
Results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method.
arXiv Detail & Related papers (2020-03-27T09:19:25Z) - Separating Content from Style Using Adversarial Learning for Recognizing
Text in the Wild [103.51604161298512]
We propose an adversarial learning framework for the generation and recognition of multiple characters in an image.
Our framework can be integrated into recent recognition methods to achieve new state-of-the-art recognition accuracy.
arXiv Detail & Related papers (2020-01-13T12:41:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.