Text Reading Order in Uncontrolled Conditions by Sparse Graph
Segmentation
- URL: http://arxiv.org/abs/2305.02577v1
- Date: Thu, 4 May 2023 06:21:00 GMT
- Title: Text Reading Order in Uncontrolled Conditions by Sparse Graph
Segmentation
- Authors: Renshen Wang, Yasuhisa Fujii and Alessandro Bissacco
- Abstract summary: We propose a lightweight, scalable and generalizable approach to identify text reading order.
The model is language-agnostic and runs effectively across multi-language datasets.
It is small enough to be deployed on virtually any platform including mobile devices.
- Score: 71.40119152422295
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text reading order is a crucial aspect in the output of an OCR engine, with a
large impact on downstream tasks. Its difficulty lies in the large variation of
domain specific layout structures, and is further exacerbated by real-world
image degradations such as perspective distortions. We propose a lightweight,
scalable and generalizable approach to identify text reading order with a
multi-modal, multi-task graph convolutional network (GCN) running on a sparse
layout based graph. Predictions from the model provide hints of bidimensional
relations among text lines and layout region structures, upon which a
post-processing cluster-and-sort algorithm generates an ordered sequence of all
the text lines. The model is language-agnostic and runs effectively across
multi-language datasets that contain various types of images taken in
uncontrolled conditions, and it is small enough to be deployed on virtually any
platform including mobile devices.
Related papers
- SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance [46.77060502803466]
We introduce the Scene Graph Adapter(SG-Adapter), leveraging the structured representation of scene graphs to rectify inaccuracies in the original text embeddings.
The SG-Adapter's explicit and non-fully connected graph representation greatly improves the fully connected, transformer-based text representations.
arXiv Detail & Related papers (2024-05-24T08:00:46Z) - Self-supervised Scene Text Segmentation with Object-centric Layered
Representations Augmented by Text Regions [22.090074821554754]
We propose a self-supervised scene text segmentation algorithm with layered decoupling of representations derived from the object-centric manner to segment images into texts and background.
On several public scene text datasets, our method outperforms the state-of-the-art unsupervised segmentation algorithms.
arXiv Detail & Related papers (2023-08-25T05:00:05Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - PARAGRAPH2GRAPH: A GNN-based framework for layout paragraph analysis [6.155943751502232]
We present a language-independent graph neural network (GNN)-based model that achieves competitive results on common document layout datasets.
Our model is suitable for industrial applications, particularly in multi-language scenarios.
arXiv Detail & Related papers (2023-04-24T03:54:48Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - StrokeNet: Stroke Assisted and Hierarchical Graph Reasoning Networks [31.76016966100244]
StrokeNet is proposed to effectively detect the texts by capturing the fine-grained strokes.
Different from existing approaches that represent the text area by a series of points or rectangular boxes, we directly localize strokes of each text instance.
arXiv Detail & Related papers (2021-11-23T08:26:42Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z) - Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [72.52804406378023]
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web.
To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning model, which decomposes video-text matching into global-to-local levels.
arXiv Detail & Related papers (2020-03-01T03:44:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.