Related papers: UPOCR: Towards Unified Pixel-Level OCR Interface

UPOCR: Towards Unified Pixel-Level OCR Interface

URL: http://arxiv.org/abs/2312.02694v1
Date: Tue, 5 Dec 2023 11:53:17 GMT
Title: UPOCR: Towards Unified Pixel-Level OCR Interface
Authors: Dezhi Peng, Zhenhua Yang, Jiaxin Zhang, Chongyu Liu, Yongxin Shi, Kai Ding, Fengjun Guo, Lianwen Jin
Abstract summary: We propose UPOCR, a simple-yet-effective generalist model for Unified Pixel-level OCR interface. Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as image-to-image transformation and the architecture as a vision Transformer (ViT)-based encoder-decoder. Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection.
Score: 36.966005829678124
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, the optical character recognition (OCR) field has been proliferating with plentiful cutting-edge approaches for a wide spectrum of tasks. However, these approaches are task-specifically designed with divergent paradigms, architectures, and training strategies, which significantly increases the complexity of research and maintenance and hinders the fast deployment in applications. To this end, we propose UPOCR, a simple-yet-effective generalist model for Unified Pixel-level OCR interface. Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as image-to-image transformation and the architecture as a vision Transformer (ViT)-based encoder-decoder. Learnable task prompts are introduced to push the general feature representations extracted by the encoder toward task-specific spaces, endowing the decoder with task awareness. Moreover, the model training is uniformly aimed at minimizing the discrepancy between the generated and ground-truth images regardless of the inhomogeneity among tasks. Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection. Without bells and whistles, the experimental results showcase that the proposed method can simultaneously achieve state-of-the-art performance on three tasks with a unified single model, which provides valuable strategies and insights for future research on generalist OCR models. Code will be publicly available.

Related papers

DiffRIS: Enhancing Referring Remote Sensing Image Segmentation with Pre-trained Text-to-Image Diffusion Models [9.109484087832058]
DiffRIS is a novel framework that harnesses the semantic understanding capabilities of pre-trained text-to-image diffusion models for RRSIS tasks.<n>Our framework introduces two key innovations: a context perception adapter (CP-adapter) and a cross-modal reasoning decoder (PCMRD)
arXiv Detail & Related papers (2025-06-23T02:38:56Z)
VISTA-OCR: Towards generative and interactive end to end OCR models [3.7548609506798494]
VISTA-OCR is a lightweight architecture that unifies text detection and recognition within a single generative model. Built on an encoder-decoder architecture, VISTA-OCR is progressively trained, starting with the visual feature extraction phase. To enhance the model's capabilities, we built a new dataset composed of real-world examples enriched with bounding box annotations and synthetic samples.
arXiv Detail & Related papers (2025-04-04T17:39:53Z)
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding. Existing solutions often rely on task-specific architectures and objectives for individual tasks. In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z)
Modality and Task Adaptation for Enhanced Zero-shot Composed Image Retrieval [20.612534837883892]
Zero-Shot Composed Image Retrieval (ZS-CIR) is designed to retrieve target images using bi-modal (image+text) queries.<n>We propose a lightweight post-hoc framework, consisting of two components.<n>Experiments show that, with the incorporation of our proposed components, inversion-based methods achieve significant improvements.
arXiv Detail & Related papers (2024-10-31T08:49:05Z)
Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects [31.926206783846144]
We show that a Vision Transformer (ViT) fails dramatically on most ARC tasks even when trained on one million examples per task. We propose ViTARC, a ViT-style architecture that unlocks some of the visual reasoning capabilities required by the ARC. Our task-specific ViTARC models achieve a test solve rate close to 100% on more than half of the 400 public ARC tasks.
arXiv Detail & Related papers (2024-10-08T22:25:34Z)
UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model. We show that UNIT significantly outperforms existing methods on document-related tasks. Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z)
An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval [34.065449743428005]
Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable image searches. Traditional Zero-Shot (ZS) CIR methods bypass the need for expensive training CIR triplets by projecting image embeddings into the text token embedding space. We introduce Reducing Taskrepancy of Texts (RTD), an efficient text-only framework that complements projection-based CIR methods.
arXiv Detail & Related papers (2024-06-13T14:49:28Z)
Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z)
DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models. We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z)
Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering [8.382903851560595]
Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content. Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems. We propose a multimodal adversarial training architecture with spatial awareness capabilities.
arXiv Detail & Related papers (2024-03-14T11:22:06Z)
Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. Recent research sidesteps this need by using large-scale vision-language models (VLMs) We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z)
Low-Resolution Self-Attention for Semantic Segmentation [96.81482872022237]
We introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost. Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution. We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure.
arXiv Detail & Related papers (2023-10-08T06:10:09Z)
SeqTR: A Simple yet Universal Network for Visual Grounding [88.03253818868204]
We propose a simple yet universal network termed SeqTR for visual grounding tasks. We cast visual grounding as a point prediction problem conditioned on image and text inputs. Under this paradigm, visual grounding tasks are unified in our SeqTR network without task-specific branches or heads.
arXiv Detail & Related papers (2022-03-30T12:52:46Z)
Donut: Document Understanding Transformer without OCR [17.397447819420695]
We propose a novel VDU model that is end-to-end trainable without underpinning OCR framework. Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets.
arXiv Detail & Related papers (2021-11-30T18:55:19Z)
TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text. Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy. We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.