Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition
- URL: http://arxiv.org/abs/2103.06495v1
- Date: Thu, 11 Mar 2021 06:47:45 GMT
- Title: Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition
- Authors: Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, Yongdong Zhang
- Abstract summary: Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
- Score: 80.446770909975
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Linguistic knowledge is of great benefit to scene text recognition. However,
how to effectively model linguistic rules in end-to-end deep networks remains a
research challenge. In this paper, we argue that the limited capacity of
language models comes from: 1) implicitly language modeling; 2) unidirectional
feature representation; and 3) language model with noise input.
Correspondingly, we propose an autonomous, bidirectional and iterative ABINet
for scene text recognition. Firstly, the autonomous suggests to block gradient
flow between vision and language models to enforce explicitly language
modeling. Secondly, a novel bidirectional cloze network (BCN) as the language
model is proposed based on bidirectional feature representation. Thirdly, we
propose an execution manner of iterative correction for language model which
can effectively alleviate the impact of noise input. Additionally, based on the
ensemble of iterative predictions, we propose a self-training method which can
learn from unlabeled images effectively. Extensive experiments indicate that
ABINet has superiority on low-quality images and achieves state-of-the-art
results on several mainstream benchmarks. Besides, the ABINet trained with
ensemble self-training shows promising improvement in realizing human-level
recognition. Code is available at https://github.com/FangShancheng/ABINet.
Related papers
- Multilingual Conceptual Coverage in Text-to-Image Models [98.80343331645626]
"Conceptual Coverage Across Languages" (CoCo-CroLa) is a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns.
For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language.
arXiv Detail & Related papers (2023-06-02T17:59:09Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Bidirectional Representations for Low Resource Spoken Language
Understanding [39.208462511430554]
We propose a representation model to encode speech in bidirectional rich encodings.
The approach uses a masked language modelling objective to learn the representations.
We show that the performance of the resulting encodings is better than comparable models on multiple datasets.
arXiv Detail & Related papers (2022-11-24T17:05:16Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Are discrete units necessary for Spoken Language Modeling? [10.374092717909603]
Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels.
We show that discretization is indeed essential for good results in spoken language modeling.
We also show that an end-to-end model trained with discrete target like HuBERT achieves similar results as the best language model trained on pseudo-text.
arXiv Detail & Related papers (2022-03-11T14:14:35Z) - TunBERT: Pretrained Contextualized Text Representation for Tunisian
Dialect [0.0]
We investigate the feasibility of training monolingual Transformer-based language models for under represented languages.
We show that the use of noisy web crawled data instead of structured data is more convenient for such non-standardized language.
Our best performing TunBERT model reaches or improves the state-of-the-art in all three downstream tasks.
arXiv Detail & Related papers (2021-11-25T15:49:50Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z) - Learning Spoken Language Representations with Neural Lattice Language
Modeling [39.50831917042577]
We propose a framework that trains neural lattice language models to provide contextualized representations for spoken language understanding tasks.
The proposed two-stage pre-training approach reduces the demands of speech data and has better efficiency.
arXiv Detail & Related papers (2020-07-06T10:38:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.