Scene Text Recognition with Permuted Autoregressive Sequence Models
- URL: http://arxiv.org/abs/2207.06966v1
- Date: Thu, 14 Jul 2022 14:51:50 GMT
- Title: Scene Text Recognition with Permuted Autoregressive Sequence Models
- Authors: Darwin Bautista, Rowel Atienza
- Abstract summary: Context-aware STR methods typically use internal autoregressive (AR) language models (LM)
Our method, PARSeq, learns an ensemble of internal AR LMs with shared weights using Permutation Language Modeling.
It achieves context-free non-AR and context-aware AR inference, and iterative refinement using bidirectional context.
- Score: 15.118059441365343
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Context-aware STR methods typically use internal autoregressive (AR) language
models (LM). Inherent limitations of AR models motivated two-stage methods
which employ an external LM. The conditional independence of the external LM on
the input image may cause it to erroneously rectify correct predictions,
leading to significant inefficiencies. Our method, PARSeq, learns an ensemble
of internal AR LMs with shared weights using Permutation Language Modeling. It
unifies context-free non-AR and context-aware AR inference, and iterative
refinement using bidirectional context. Using synthetic training data, PARSeq
achieves state-of-the-art (SOTA) results in STR benchmarks (91.9% accuracy) and
more challenging datasets. It establishes new SOTA results (96.0% accuracy)
when trained on real data. PARSeq is optimal on accuracy vs parameter count,
FLOPS, and latency because of its simple, unified structure and parallel token
processing. Due to its extensive use of attention, it is robust on
arbitrarily-oriented text which is common in real-world images. Code,
pretrained weights, and data are available at: https://github.com/baudm/parseq.
Related papers
- Context-aware Prompt Tuning: Advancing In-Context Learning with Adversarial Methods [69.36397993451742]
This work introduces Context-aware Prompt Tuning (CPT), a method inspired by ICL, PT, and adversarial attacks.
We modify specific context tokens, considering the unique structure of input and output formats.
Inspired by adversarial attacks, we adjust the input based on the labels present in the context, focusing on minimizing, rather than maximizing, the loss.
arXiv Detail & Related papers (2024-10-22T17:45:47Z) - LLM-based speaker diarization correction: A generalizable approach [0.0]
We investigate the use of large language models (LLMs) for diarization correction as a post-processing step.
The ability of the models to improve diarization accuracy in a holdout dataset from the Fisher corpus as well as an independent dataset was measured.
arXiv Detail & Related papers (2024-06-07T13:33:22Z) - LLM-augmented Preference Learning from Natural Language [19.700169351688768]
Large Language Models (LLMs) are equipped to deal with larger context lengths.
LLMs can consistently outperform the SotA when the target text is large.
Few-shot learning yields better performance than zero-shot learning.
arXiv Detail & Related papers (2023-10-12T17:17:27Z) - Context Perception Parallel Decoder for Scene Text Recognition [52.620841341333524]
Scene text recognition methods have struggled to attain high accuracy and fast inference speed.
We present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception.
We construct a series of CPPD models and also plug the proposed modules into existing STR decoders. Experiments on both English and Chinese benchmarks demonstrate that the CPPD models achieve highly competitive accuracy while running approximately 8x faster than their AR-based counterparts.
arXiv Detail & Related papers (2023-07-23T09:04:13Z) - Mixture of Soft Prompts for Controllable Data Generation [21.84489422361048]
Mixture of Soft Prompts (MSP) is proposed as a tool for data augmentation rather than direct prediction.
Our method achieves state-of-the-art results on three benchmarks when compared against strong baselines.
arXiv Detail & Related papers (2023-03-02T21:13:56Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - Learning to Ask Conversational Questions by Optimizing Levenshtein
Distance [83.53855889592734]
We introduce a Reinforcement Iterative Sequence Editing (RISE) framework that optimize the minimum Levenshtein distance (MLD) through explicit editing actions.
RISE is able to pay attention to tokens that are related to conversational characteristics.
Experimental results on two benchmark datasets show that RISE significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-06-30T08:44:19Z) - Injecting Knowledge in Data-driven Vehicle Trajectory Predictors [82.91398970736391]
Vehicle trajectory prediction tasks have been commonly tackled from two perspectives: knowledge-driven or data-driven.
In this paper, we propose to learn a "Realistic Residual Block" (RRB) which effectively connects these two perspectives.
Our proposed method outputs realistic predictions by confining the residual range and taking into account its uncertainty.
arXiv Detail & Related papers (2021-03-08T16:03:09Z) - Improving AMR Parsing with Sequence-to-Sequence Pre-training [39.33133978535497]
In this paper, we focus on sequence-to-sequence (seq2seq) AMR parsing.
We propose a seq2seq pre-training approach to build pre-trained models in both single and joint way.
Experiments show that both the single and joint pre-trained models significantly improve the performance.
arXiv Detail & Related papers (2020-10-05T04:32:47Z) - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks [133.93803565077337]
retrieval-augmented generation models combine pre-trained parametric and non-parametric memory for language generation.
We show that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
arXiv Detail & Related papers (2020-05-22T21:34:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.