TextMatcher: Cross-Attentional Neural Network to Compare Image and Text
- URL: http://arxiv.org/abs/2205.05507v1
- Date: Wed, 11 May 2022 14:01:12 GMT
- Title: TextMatcher: Cross-Attentional Neural Network to Compare Image and Text
- Authors: Valentina Arrigoni, Luisa Repele, Dario Marino Saccavino
- Abstract summary: We devise the first machine-learning model specifically designed for this problem.
We extensively evaluate the empirical performance of TextMatcher on the popular IAM dataset.
We showcase TextMatcher in a real-world application scenario concerning the automatic processing of bank cheques.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We study a novel multimodal-learning problem, which we call text matching:
given an image containing a single-line text and a candidate text
transcription, the goal is to assess whether the text represented in the image
corresponds to the candidate text. We devise the first machine-learning model
specifically designed for this problem. The proposed model, termed TextMatcher,
compares the two inputs by applying a cross-attention mechanism over the
embedding representations of image and text, and it is trained in an end-to-end
fashion. We extensively evaluate the empirical performance of TextMatcher on
the popular IAM dataset. Results attest that, compared to a baseline and
existing models designed for related problems, TextMatcher achieves higher
performance on a variety of configurations, while at the same time running
faster at inference time. We also showcase TextMatcher in a real-world
application scenario concerning the automatic processing of bank cheques.
Related papers
- FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - TextDiffuser: Diffusion Models as Text Painters [118.30923824681642]
We introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds.
We contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs.
We show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text.
arXiv Detail & Related papers (2023-05-18T10:16:19Z) - What You See is What You Read? Improving Text-Image Alignment Evaluation [28.722369586165108]
We study methods for automatic text-image alignment evaluation.
We first introduce SeeTRUE, spanning multiple datasets from both text-to-image and image-to-text generation tasks.
We describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models.
arXiv Detail & Related papers (2023-05-17T17:43:38Z) - Text-Conditioned Sampling Framework for Text-to-Image Generation with
Masked Generative Models [52.29800567587504]
We propose a learnable sampling model, Text-Conditioned Token Selection (TCTS), to select optimal tokens via localized supervision with text information.
TCTS improves not only the image quality but also the semantic alignment of the generated images with the given texts.
We validate the efficacy of TCTS combined with Frequency Adaptive Sampling (FAS) with various generative tasks, demonstrating that it significantly outperforms the baselines in image-text alignment and image quality.
arXiv Detail & Related papers (2023-04-04T03:52:49Z) - WordStylist: Styled Verbatim Handwritten Text Generation with Latent
Diffusion Models [8.334487584550185]
We present a latent diffusion-based method for styled text-to-text-content-image generation on word-level.
Our proposed method is able to generate realistic word image samples from different writer styles.
We show that the proposed model produces samples that are aesthetically pleasing, help boosting text recognition performance, and get similar writer retrieval score as real data.
arXiv Detail & Related papers (2023-03-29T10:19:26Z) - JOIST: A Joint Speech and Text Streaming Model For ASR [63.15848310748753]
We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs.
We find that best text representation for JOIST improves WER across a variety of search and rare-word test sets by 4-14% relative, compared to a model not trained with text.
arXiv Detail & Related papers (2022-10-13T20:59:22Z) - Scene Text Retrieval via Joint Text Detection and Similarity Learning [68.24531728554892]
Scene text retrieval aims to localize and search all text instances from an image gallery, which are the same or similar to a given query text.
We address this problem by directly learning a cross-modal similarity between a query text and each text instance from natural images.
In this way, scene text retrieval can be simply performed by ranking the detected text instances with the learned similarity.
arXiv Detail & Related papers (2021-04-04T07:18:38Z) - Deep Multimodal Image-Text Embeddings for Automatic Cross-Media
Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously.
The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.