UPOCR: Towards Unified Pixel-Level OCR Interface
- URL: http://arxiv.org/abs/2312.02694v1
- Date: Tue, 5 Dec 2023 11:53:17 GMT
- Title: UPOCR: Towards Unified Pixel-Level OCR Interface
- Authors: Dezhi Peng, Zhenhua Yang, Jiaxin Zhang, Chongyu Liu, Yongxin Shi, Kai
Ding, Fengjun Guo, Lianwen Jin
- Abstract summary: We propose UPOCR, a simple-yet-effective generalist model for Unified Pixel-level OCR interface.
Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as image-to-image transformation and the architecture as a vision Transformer (ViT)-based encoder-decoder.
Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection.
- Score: 36.966005829678124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the optical character recognition (OCR) field has been
proliferating with plentiful cutting-edge approaches for a wide spectrum of
tasks. However, these approaches are task-specifically designed with divergent
paradigms, architectures, and training strategies, which significantly
increases the complexity of research and maintenance and hinders the fast
deployment in applications. To this end, we propose UPOCR, a
simple-yet-effective generalist model for Unified Pixel-level OCR interface.
Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as
image-to-image transformation and the architecture as a vision Transformer
(ViT)-based encoder-decoder. Learnable task prompts are introduced to push the
general feature representations extracted by the encoder toward task-specific
spaces, endowing the decoder with task awareness. Moreover, the model training
is uniformly aimed at minimizing the discrepancy between the generated and
ground-truth images regardless of the inhomogeneity among tasks. Experiments
are conducted on three pixel-level OCR tasks including text removal, text
segmentation, and tampered text detection. Without bells and whistles, the
experimental results showcase that the proposed method can simultaneously
achieve state-of-the-art performance on three tasks with a unified single
model, which provides valuable strategies and insights for future research on
generalist OCR models. Code will be publicly available.
Related papers
- Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models.
We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z) - Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering [8.382903851560595]
Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content.
Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems.
We propose a multimodal adversarial training architecture with spatial awareness capabilities.
arXiv Detail & Related papers (2024-03-14T11:22:06Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Low-Resolution Self-Attention for Semantic Segmentation [96.81482872022237]
We introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost.
Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution.
We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure.
arXiv Detail & Related papers (2023-10-08T06:10:09Z) - C3-STISR: Scene Text Image Super-resolution with Triple Clues [22.41802601665541]
Scene text image super-resolution (STISR) has been regarded as an important pre-processing task for text recognition.
Most recent approaches use the recognizer's feedback as clues to guide super-resolution.
We present a novel method C3-STISR that jointly exploits the recognizer's feedback, visual and linguistical information as clues to guide super-resolution.
arXiv Detail & Related papers (2022-04-29T12:39:51Z) - SeqTR: A Simple yet Universal Network for Visual Grounding [88.03253818868204]
We propose a simple yet universal network termed SeqTR for visual grounding tasks.
We cast visual grounding as a point prediction problem conditioned on image and text inputs.
Under this paradigm, visual grounding tasks are unified in our SeqTR network without task-specific branches or heads.
arXiv Detail & Related papers (2022-03-30T12:52:46Z) - Donut: Document Understanding Transformer without OCR [17.397447819420695]
We propose a novel VDU model that is end-to-end trainable without underpinning OCR framework.
Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets.
arXiv Detail & Related papers (2021-11-30T18:55:19Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - Hierarchical Deep CNN Feature Set-Based Representation Learning for
Robust Cross-Resolution Face Recognition [59.29808528182607]
Cross-resolution face recognition (CRFR) is important in intelligent surveillance and biometric forensics.
Existing shallow learning-based and deep learning-based methods focus on mapping the HR-LR face pairs into a joint feature space.
In this study, we desire to fully exploit the multi-level deep convolutional neural network (CNN) feature set for robust CRFR.
arXiv Detail & Related papers (2021-03-25T14:03:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.