Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text
Recognition
- URL: http://arxiv.org/abs/2107.12090v2
- Date: Tue, 27 Jul 2021 02:27:15 GMT
- Title: Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text
Recognition
- Authors: Ayan Kumar Bhunia, Aneeshan Sain, Amandeep Kumar, Shuvozit Ghose,
Pinaki Nath Chowdhury, Yi-Zhe Song
- Abstract summary: State-of-the-art (SOTA) models still struggle in the wild scenarios due to complex backgrounds, varying fonts, uncontrolled illuminations, distortions and other artefacts.
This is because such models solely depend on visual information for text recognition, thus lacking semantic reasoning capabilities.
We propose a multi-stage multi-scale attentional decoder that performs joint visual-semantic reasoning.
- Score: 36.12001394921506
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although text recognition has significantly evolved over the years,
state-of-the-art (SOTA) models still struggle in the wild scenarios due to
complex backgrounds, varying fonts, uncontrolled illuminations, distortions and
other artefacts. This is because such models solely depend on visual
information for text recognition, thus lacking semantic reasoning capabilities.
In this paper, we argue that semantic information offers a complementary role
in addition to visual only. More specifically, we additionally utilize semantic
information by proposing a multi-stage multi-scale attentional decoder that
performs joint visual-semantic reasoning. Our novelty lies in the intuition
that for text recognition, the prediction should be refined in a stage-wise
manner. Therefore our key contribution is in designing a stage-wise unrolling
attentional decoder where non-differentiability, invoked by discretely
predicted character labels, needs to be bypassed for end-to-end training. While
the first stage predicts using visual features, subsequent stages refine on top
of it using joint visual-semantic information. Additionally, we introduce
multi-scale 2D attention along with dense and residual connections between
different stages to deal with varying scales of character sizes, for better
performance and faster convergence during training. Experimental results show
our approach to outperform existing SOTA methods by a considerable margin.
Related papers
- Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Multi-modal Text Recognition Networks: Interactive Enhancements between
Visual and Semantic Features [11.48760300147023]
This paper introduces a novel method, called Multi-Almod Text Recognition Network (MATRN)
MATRN identifies visual and semantic feature pairs and encodes spatial information into semantic features.
Our experiments demonstrate that MATRN achieves state-of-the-art performances on seven benchmarks with large margins.
arXiv Detail & Related papers (2021-11-30T10:22:11Z) - A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems.
Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition.
We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Towards the Unseen: Iterative Text Recognition by Distilling from Errors [41.43280922432707]
Prior arts mostly struggle with recognising unseen (or rarely seen) character sequences.
We put forward a novel framework to tackle this "unseen" problem.
Key to our success is a unique cross-modal variational autoencoder.
arXiv Detail & Related papers (2021-07-26T10:06:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.