Using Human Psychophysics to Evaluate Generalization in Scene Text
Recognition Models
- URL: http://arxiv.org/abs/2007.00083v1
- Date: Tue, 30 Jun 2020 19:51:26 GMT
- Title: Using Human Psychophysics to Evaluate Generalization in Scene Text
Recognition Models
- Authors: Sahar Siddiqui, Elena Sizikova, Gemma Roig, Najib J. Majaj, Denis G.
Pelli
- Abstract summary: We characterize two important scene text recognition models by measuring their domains.
The domains specifies the ability of readers to generalize to different word lengths, fonts, and amounts of occlusion.
- Score: 7.294729862905325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text recognition models have advanced greatly in recent years. Inspired
by human reading we characterize two important scene text recognition models by
measuring their domains i.e. the range of stimulus images that they can read.
The domain specifies the ability of readers to generalize to different word
lengths, fonts, and amounts of occlusion. These metrics identify strengths and
weaknesses of existing models. Relative to the attention-based (Attn) model, we
discover that the connectionist temporal classification (CTC) model is more
robust to noise and occlusion, and better at generalizing to different word
lengths. Further, we show that in both models, adding noise to training images
yields better generalization to occlusion. These results demonstrate the value
of testing models till they break, complementing the traditional data science
focus on optimizing performance.
Related papers
- Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images.
We identify model weaknesses by testing the model using the counterfactual image dataset.
We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z) - ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling [35.098725056881655]
Recent large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities.
The generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements.
We introduce a novel framework, ViGoR, that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines.
arXiv Detail & Related papers (2024-02-09T01:00:14Z) - Visual Analytics for Efficient Image Exploration and User-Guided Image
Captioning [35.47078178526536]
Recent advancements in pre-trained large-scale language-image models have ushered in a new era of visual comprehension.
This paper tackles two well-known issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of potential data biases within them; (2) the evaluation of image captions and steering of their generation process.
arXiv Detail & Related papers (2023-11-02T06:21:35Z) - Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet
Hierarchy [12.82992353036576]
We measure the capability of popular text-to-image models to understand $textithypernymy$, or the "is-a" relation between words.
We show how our metrics can provide a better understanding of the individual strengths and weaknesses of popular text-to-image models.
arXiv Detail & Related papers (2023-10-13T16:53:25Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - Few-shot Domain-Adaptive Visually-fused Event Detection from Text [13.189886554546929]
We present a novel domain-adaptive visually-fused event detection approach that can be trained on a few labelled image-text paired data points.
Specifically, we introduce a visual imaginator method that synthesises images from text in the absence of visual context.
Our model can leverage the capabilities of pre-trained vision-language models and can be trained in a few-shot setting.
arXiv Detail & Related papers (2023-05-04T00:10:57Z) - Learnable Visual Words for Interpretable Image Recognition [70.85686267987744]
We propose the Learnable Visual Words (LVW) to interpret the model prediction behaviors with two novel modules.
The semantic visual words learning relaxes the category-specific constraint, enabling the general visual words shared across different categories.
Our experiments on six visual benchmarks demonstrate the superior effectiveness of our proposed LVW in both accuracy and model interpretation.
arXiv Detail & Related papers (2022-05-22T03:24:45Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z) - Temporal Embeddings and Transformer Models for Narrative Text
Understanding [72.88083067388155]
We present two approaches to narrative text understanding for character relationship modelling.
The temporal evolution of these relations is described by dynamic word embeddings, that are designed to learn semantic changes over time.
A supervised learning approach based on the state-of-the-art transformer model BERT is used instead to detect static relations between characters.
arXiv Detail & Related papers (2020-03-19T14:23:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.