Towards the Unseen: Iterative Text Recognition by Distilling from Errors
- URL: http://arxiv.org/abs/2107.12081v1
- Date: Mon, 26 Jul 2021 10:06:42 GMT
- Title: Towards the Unseen: Iterative Text Recognition by Distilling from Errors
- Authors: Ayan Kumar Bhunia, Pinaki Nath Chowdhury, Aneeshan Sain, Yi-Zhe Song
- Abstract summary: Prior arts mostly struggle with recognising unseen (or rarely seen) character sequences.
We put forward a novel framework to tackle this "unseen" problem.
Key to our success is a unique cross-modal variational autoencoder.
- Score: 41.43280922432707
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual text recognition is undoubtedly one of the most extensively researched
topics in computer vision. Great progress have been made to date, with the
latest models starting to focus on the more practical "in-the-wild" setting.
However, a salient problem still hinders practical deployment -- prior arts
mostly struggle with recognising unseen (or rarely seen) character sequences.
In this paper, we put forward a novel framework to specifically tackle this
"unseen" problem. Our framework is iterative in nature, in that it utilises
predicted knowledge of character sequences from a previous iteration, to
augment the main network in improving the next prediction. Key to our success
is a unique cross-modal variational autoencoder to act as a feedback module,
which is trained with the presence of textual error distribution data. This
module importantly translate a discrete predicted character space, to a
continuous affine transformation parameter space used to condition the visual
feature map at next iteration. Experiments on common datasets have shown
competitive performance over state-of-the-arts under the conventional setting.
Most importantly, under the new disjoint setup where train-test labels are
mutually exclusive, ours offers the best performance thus showcasing the
capability of generalising onto unseen words.
Related papers
- A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems.
Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition.
We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - Learning to Prompt for Vision-Language Models [82.25005817904027]
Vision-language pre-training has emerged as a promising alternative for representation learning.
It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders.
Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks.
arXiv Detail & Related papers (2021-09-02T17:57:31Z) - Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text
Recognition [36.12001394921506]
State-of-the-art (SOTA) models still struggle in the wild scenarios due to complex backgrounds, varying fonts, uncontrolled illuminations, distortions and other artefacts.
This is because such models solely depend on visual information for text recognition, thus lacking semantic reasoning capabilities.
We propose a multi-stage multi-scale attentional decoder that performs joint visual-semantic reasoning.
arXiv Detail & Related papers (2021-07-26T10:15:14Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z) - Incomplete Utterance Rewriting as Semantic Segmentation [57.13577518412252]
We present a novel and extensive approach, which formulates it as a semantic segmentation task.
Instead of generating from scratch, such a formulation introduces edit operations and shapes the problem as prediction of a word-level edit matrix.
Our approach is four times faster than the standard approach in inference.
arXiv Detail & Related papers (2020-09-28T09:29:49Z) - Pay Attention to What You Read: Non-recurrent Handwritten Text-Line
Recognition [4.301658883577544]
We introduce a non-recurrent approach to recognize handwritten text by the use of transformer models.
We are able to tackle character recognition as well as to learn language-related dependencies of the character sequences to be decoded.
arXiv Detail & Related papers (2020-05-26T21:15:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.