Related papers: Using Human Psychophysics to Evaluate Generalization in Scene Text Recognition Models

Using Human Psychophysics to Evaluate Generalization in Scene Text Recognition Models

URL: http://arxiv.org/abs/2007.00083v1
Date: Tue, 30 Jun 2020 19:51:26 GMT
Title: Using Human Psychophysics to Evaluate Generalization in Scene Text Recognition Models
Authors: Sahar Siddiqui, Elena Sizikova, Gemma Roig, Najib J. Majaj, Denis G. Pelli
Abstract summary: We characterize two important scene text recognition models by measuring their domains. The domains specifies the ability of readers to generalize to different word lengths, fonts, and amounts of occlusion.
Score: 7.294729862905325
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scene text recognition models have advanced greatly in recent years. Inspired by human reading we characterize two important scene text recognition models by measuring their domains i.e. the range of stimulus images that they can read. The domain specifies the ability of readers to generalize to different word lengths, fonts, and amounts of occlusion. These metrics identify strengths and weaknesses of existing models. Relative to the attention-based (Attn) model, we discover that the connectionist temporal classification (CTC) model is more robust to noise and occlusion, and better at generalizing to different word lengths. Further, we show that in both models, adding noise to training images yields better generalization to occlusion. These results demonstrate the value of testing models till they break, complementing the traditional data science focus on optimizing performance.

Related papers

DILLEMA: Diffusion and Large Language Models for Multi-Modal Augmentation [0.13124513975412253]
We present a novel framework for testing vision neural networks that leverages Large Language Models and control-conditioned Diffusion Models. Our approach begins by translating images into detailed textual descriptions using a captioning model. These descriptions are then used to produce new test images through a text-to-image diffusion process.
arXiv Detail & Related papers (2025-02-05T16:35:42Z)
TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models [39.06617653124486]
We introduce a new evaluation framework called TypeScore to assess a model's ability to generate images with high-fidelity embedded text. Our proposed metric demonstrates greater resolution than CLIPScore to differentiate popular image generation models.
arXiv Detail & Related papers (2024-11-02T07:56:54Z)
Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models. We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z)
KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities [93.74881034001312]
We conduct a systematic study on the fidelity of entities in text-to-image generation models. We focus on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals. Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details.
arXiv Detail & Related papers (2024-10-15T17:50:37Z)
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling [35.098725056881655]
Large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities. The generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements. We introduce a novel framework, ViGoR, that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines.
arXiv Detail & Related papers (2024-02-09T01:00:14Z)
Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning [35.47078178526536]
Recent advancements in pre-trained large-scale language-image models have ushered in a new era of visual comprehension. This paper tackles two well-known issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of potential data biases within them; (2) the evaluation of image captions and steering of their generation process.
arXiv Detail & Related papers (2023-11-02T06:21:35Z)
Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet Hierarchy [12.82992353036576]
We measure the capability of popular text-to-image models to understand $textithypernymy$, or the "is-a" relation between words. We show how our metrics can provide a better understanding of the individual strengths and weaknesses of popular text-to-image models.
arXiv Detail & Related papers (2023-10-13T16:53:25Z)
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models. We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z)
Learnable Visual Words for Interpretable Image Recognition [70.85686267987744]
We propose the Learnable Visual Words (LVW) to interpret the model prediction behaviors with two novel modules. The semantic visual words learning relaxes the category-specific constraint, enabling the general visual words shared across different categories. Our experiments on six visual benchmarks demonstrate the superior effectiveness of our proposed LVW in both accuracy and model interpretation.
arXiv Detail & Related papers (2022-05-22T03:24:45Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
Temporal Embeddings and Transformer Models for Narrative Text Understanding [72.88083067388155]
We present two approaches to narrative text understanding for character relationship modelling. The temporal evolution of these relations is described by dynamic word embeddings, that are designed to learn semantic changes over time. A supervised learning approach based on the state-of-the-art transformer model BERT is used instead to detect static relations between characters.
arXiv Detail & Related papers (2020-03-19T14:23:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.