Related papers: Geometric Perception based Efficient Text Recognition

Geometric Perception based Efficient Text Recognition

URL: http://arxiv.org/abs/2302.03873v1
Date: Wed, 8 Feb 2023 04:19:24 GMT
Title: Geometric Perception based Efficient Text Recognition
Authors: P.N.Deelaka, D.R.Jayakodi, D.Y.Silva
Abstract summary: In real-world applications with fixed camera positions, the underlying data tends to be regular scene text. This paper introduces the underlying concepts, theory, implementation, and experiment results to develop specialized models. We introduce a novel deep learning architecture (GeoTRNet), trained to identify digits in a regular scene image, only using the geometrical features present.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Every Scene Text Recognition (STR) task consists of text localization \& text recognition as the prominent sub-tasks. However, in real-world applications with fixed camera positions such as equipment monitor reading, image-based data entry, and printed document data extraction, the underlying data tends to be regular scene text. Hence, in these tasks, the use of generic, bulky models comes up with significant disadvantages compared to customized, efficient models in terms of model deployability, data privacy \& model reliability. Therefore, this paper introduces the underlying concepts, theory, implementation, and experiment results to develop models, which are highly specialized for the task itself, to achieve not only the SOTA performance but also to have minimal model weights, shorter inference time, and high model reliability. We introduce a novel deep learning architecture (GeoTRNet), trained to identify digits in a regular scene image, only using the geometrical features present, mimicking human perception over text recognition. The code is publicly available at https://github.com/ACRA-FL/GeoTRNet

Related papers

Enhancing Visual Representation for Text-based Person Searching [9.601697802095119]
VFE-TPS is a Visual Feature Enhanced Text-based Person Search model. It introduces a pre-trained backbone CLIP to learn basic multimodal features. It constructs Text Guided Masked Image Modeling task to enhance the model's ability of learning local visual details.
arXiv Detail & Related papers (2024-12-30T01:38:14Z)
UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model. We show that UNIT significantly outperforms existing methods on document-related tasks. Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models. LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs. This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z)
Chart-RCNN: Efficient Line Chart Data Extraction from Camera Images [0.0]
Line Chart Data Extraction is a natural extension of Optical Character Recognition. We propose a synthetic data generation framework and a one-stage model that outputs text labels, mark coordinates, and perspective estimation simultaneously. Results show that our model trained only on synthetic data can be applied to real photos without any fine-tuning and is feasible for real-world application.
arXiv Detail & Related papers (2022-11-25T19:55:52Z)
InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images [4.544151613454639]
We argue that depending on the application, users of image retrieval systems have different and changing similarity notions. We present Language-Guided Zero-Shot Deep Metric Learning (LanZ-DML) as a new DML setting. InDiReCT is a model for LanZ-DML on images that exclusively uses a few text prompts for training.
arXiv Detail & Related papers (2022-11-23T08:09:50Z)
The Surprisingly Straightforward Scene Text Removal Method With Gated Attention and Region of Interest Generation: A Comprehensive Prominent Model Analysis [0.76146285961466]
Scene text removal (STR) is a task of erasing text from natural scene images. We introduce a simple yet extremely effective Gated Attention (GA) and Region-of-Interest Generation (RoIG) methodology in this paper. Experimental results on the benchmark dataset show that our method significantly outperforms existing state-of-the-art methods in almost all metrics.
arXiv Detail & Related papers (2022-10-14T03:34:21Z)
ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training [29.240131406803794]
We show that a common space can be created without any training at all, using single-domain encoders and a much smaller amount of image-text pairs. Our model has unique properties, most notably, deploying a new version with updated training samples can be done in a matter of seconds.
arXiv Detail & Related papers (2022-10-04T16:56:22Z)
GIT: A Generative Image-to-text Transformer for Vision and Language [138.91581326369837]
We train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. Our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
arXiv Detail & Related papers (2022-05-27T17:03:38Z)
LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model. We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z)
TLGAN: document Text Localization using Generative Adversarial Nets [2.1378501793514277]
Text localization from digital image is first step for optical character recognition. Deep neural networks are used to perform text localization from digital image. Training only ten labeled receipt images from Robust Reading Challenge on Scanned Receipts OCR and Information Extraction. TLGAN achieved 99.83% precision and 99.64% recall for SROIE test data.
arXiv Detail & Related papers (2020-10-22T09:19:13Z)
KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation [100.79870384880333]
We propose a knowledge-grounded pre-training (KGPT) to generate knowledge-enriched text. We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness. Under zero-shot setting, our model achieves over 30 ROUGE-L on WebNLG while all other baselines fail.
arXiv Detail & Related papers (2020-10-05T19:59:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.