Related papers: A Fast Fully Octave Convolutional Neural Network for Document Image Segmentation

A Fast Fully Octave Convolutional Neural Network for Document Image Segmentation

URL: http://arxiv.org/abs/2004.01317v1
Date: Fri, 3 Apr 2020 00:57:33 GMT
Title: A Fast Fully Octave Convolutional Neural Network for Document Image Segmentation
Authors: Ricardo Batista das Neves Junior, Luiz Felipe Ver\c{c}osa, David Mac\^edo, Byron Leite Dantas Bezerra, Cleber Zanchettin
Abstract summary: We investigate a method based on U-Net to detect the document edges and text regions in ID images. We propose a model optimization based on Octave Convolutions to qualify the method to situations where storage, processing, and time resources are limited. Our results showed that the proposed models are efficient to document segmentation tasks and portable.
Score: 1.8426817621478804
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The Know Your Customer (KYC) and Anti Money Laundering (AML) are worldwide practices to online customer identification based on personal identification documents, similarity and liveness checking, and proof of address. To answer the basic regulation question: are you whom you say you are? The customer needs to upload valid identification documents (ID). This task imposes some computational challenges since these documents are diverse, may present different and complex backgrounds, some occlusion, partial rotation, poor quality, or damage. Advanced text and document segmentation algorithms were used to process the ID images. In this context, we investigated a method based on U-Net to detect the document edges and text regions in ID images. Besides the promising results on image segmentation, the U-Net based approach is computationally expensive for a real application, since the image segmentation is a customer device task. We propose a model optimization based on Octave Convolutions to qualify the method to situations where storage, processing, and time resources are limited, such as in mobile and robotic applications. We conducted the evaluation experiments in two new datasets CDPhotoDataset and DTDDataset, which are composed of real ID images of Brazilian documents. Our results showed that the proposed models are efficient to document segmentation tasks and portable.

Related papers

Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark [14.379556287829471]
Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query.<n>Existing DIR methods are primarily based on image queries that retrieve documents within the same coarse semantic category.<n>We introduce a new Natural Language-based Document Image Retrieval benchmark with corresponding evaluation metrics.
arXiv Detail & Related papers (2025-12-23T09:14:16Z)
Words as Geometric Features: Estimating Homography using Optical Character Recognition as Compressed Image Representation [6.385732495789276]
Document alignment plays a crucial role in numerous real-world applications, such as automated form processing, anomaly detection, and workflow automation.<n>Traditional methods for document alignment rely on image-based features like keypoints, edges, and textures to estimate geometric transformations, such as homographies.<n>This paper introduces a novel approach that leverages Optical Character Recognition (OCR) outputs as features for homography estimation.
arXiv Detail & Related papers (2025-05-25T01:20:32Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Geometry Restoration and Dewarping of Camera-Captured Document Images [0.0]
This research focuses on developing a method for restoring the topology of digital images of paper documents captured by a camera. Our methodology employs deep learning (DL) for document outline detection, followed by computer vision (CV) to create a topological 2D grid.
arXiv Detail & Related papers (2025-01-06T17:12:19Z)
LookupForensics: A Large-Scale Multi-Task Dataset for Multi-Phase Image-Based Fact Verification [15.616232457341097]
We call this "image-based automated fact verification," a name that originated from a text-based fact-checking system used by journalists. We present a large-scale dataset tailored for this new task that features various hand-crafted image edits and machine learning-driven manipulations.
arXiv Detail & Related papers (2024-07-26T09:15:29Z)
Unifying Multimodal Retrieval via Document Screenshot Embedding [92.03571344075607]
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that regards document screenshots as a unified input format. We first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset. In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing.
arXiv Detail & Related papers (2024-06-17T06:27:35Z)
A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical Document Image Enhancement [13.27528507177775]
We propose textbfT2T-BinFormer which is a novel document binarization encoder-decoder architecture based on a Tokens-to-token vision transformer. Experiments on various DIBCO and H-DIBCO benchmarks demonstrate that the proposed model outperforms the existing CNN and ViT-based state-of-the-art methods.
arXiv Detail & Related papers (2023-12-06T23:01:11Z)
DocMAE: Document Image Rectification via Self-supervised Representation Learning [144.44748607192147]
We present DocMAE, a novel self-supervised framework for document image rectification. We first mask random patches of the background-excluded document images and then reconstruct the missing pixels. With such a self-supervised learning approach, the network is encouraged to learn the intrinsic structure of deformed documents.
arXiv Detail & Related papers (2023-04-20T14:27:15Z)
Deep Unrestricted Document Image Rectification [110.61517455253308]
We present DocTr++, a novel unified framework for document image rectification. We upgrade the original architecture by adopting a hierarchical encoder-decoder structure for multi-scale representation extraction and parsing. We contribute a real-world test set and metrics applicable for evaluating the rectification quality.
arXiv Detail & Related papers (2023-04-18T08:00:54Z)
Zero-Shot In-Distribution Detection in Multi-Object Settings Using Vision-Language Foundation Models [37.36999826208225]
In this paper, we propose a novel problem setting called zero-shot in-distribution (ID) detection. We identify images containing ID objects as ID images (even if they contain OOD objects) and images lacking ID objects as OOD images without any training. We present a simple and effective approach, Global-Local Concept Matching, based on both global and local visual-text alignments of CLIP features.
arXiv Detail & Related papers (2023-04-10T11:35:42Z)
ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval [51.588385824875886]
Cross-modal retrieval consists in finding images related to a given query text or vice-versa. Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks. This paper proposes an ALign And DIstill Network (ALADIN) to fill in the gap between effectiveness and efficiency.
arXiv Detail & Related papers (2022-07-29T16:01:48Z)
DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer [16.03084865625318]
Business intelligence processes often require the extraction of useful semantic content from documents. We present a transformer-based model for end-to-end segmentation of complex layouts in document images. Our model achieved comparable or better segmentation performance than the existing state-of-the-art approaches.
arXiv Detail & Related papers (2022-01-27T10:50:22Z)
One-shot Key Information Extraction from Document with Deep Partial Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios. Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents. We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z)
ICDAR 2021 Competition on Components Segmentation Task of Document Photos [63.289361617237944]
Three challenge tasks were proposed entailing different segmentation assignments to be performed on a provided dataset. The collected data are from several types of Brazilian ID documents, whose personal information was conveniently replaced. Different Deep Learning models were applied by the entrants with diverse strategies to achieve the best results in each of the tasks.
arXiv Detail & Related papers (2021-06-16T00:49:58Z)
Unsupervised Neural Domain Adaptation for Document Image Binarization [13.848843012433187]
This paper proposes a method that combines neural networks and Domain Adaptation (DA) in order to carry out unsupervised document binarization. Results show that our proposal successfully deals with the binarization of new document domains without the need for labeled data.
arXiv Detail & Related papers (2020-12-02T13:42:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.