Binarizing Documents by Leveraging both Space and Frequency
- URL: http://arxiv.org/abs/2404.17243v1
- Date: Fri, 26 Apr 2024 08:31:10 GMT
- Title: Binarizing Documents by Leveraging both Space and Frequency
- Authors: Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara,
- Abstract summary: Document Image Binarization is a well-known problem in Document Analysis and Computer Vision.
We propose an alternative solution based on the recently introduced Fast Fourier Convolutions.
- Score: 33.334956022229846
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document Image Binarization is a well-known problem in Document Analysis and Computer Vision, although it is far from being solved. One of the main challenges of this task is that documents generally exhibit degradations and acquisition artifacts that can greatly vary throughout the page. Nonetheless, even when dealing with a local patch of the document, taking into account the overall appearance of a wide portion of the page can ease the prediction by enriching it with semantic information on the ink and background conditions. In this respect, approaches able to model both local and global information have been proven suitable for this task. In particular, recent applications of Vision Transformer (ViT)-based models, able to model short and long-range dependencies via the attention mechanism, have demonstrated their superiority over standard Convolution-based models, which instead struggle to model global dependencies. In this work, we propose an alternative solution based on the recently introduced Fast Fourier Convolutions, which overcomes the limitation of standard convolutions in modeling global information while requiring fewer parameters than ViTs. We validate the effectiveness of our approach via extensive experimental analysis considering different types of degradations.
Related papers
- LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - Recurrent Few-Shot model for Document Verification [1.9686770963118383]
General-purpose ID, or travel, document image- and video-based verification systems have yet to achieve good enough performance to be considered a solved problem.
We propose a recurrent-based model able to detect forged documents in a few-shot scenario.
Preliminary results on the SIDTD and Findit datasets show good performance of this model for this task.
arXiv Detail & Related papers (2024-10-03T13:05:27Z) - Open-Set Deepfake Detection: A Parameter-Efficient Adaptation Method with Forgery Style Mixture [58.60915132222421]
We introduce an approach that is both general and parameter-efficient for face forgery detection.
We design a forgery-style mixture formulation that augments the diversity of forgery source domains.
We show that the designed model achieves state-of-the-art generalizability with significantly reduced trainable parameters.
arXiv Detail & Related papers (2024-08-23T01:53:36Z) - Robustness of Structured Data Extraction from In-plane Rotated Documents using Multi-Modal Large Language Models (LLM) [0.0]
This study investigates the impact of document skew on the data extraction accuracy of three state-of-the-art multi-modal models.
We identify the safe in-plane rotation angles (SIPRA) for each model and investigate the effects of skew on model hallucinations.
arXiv Detail & Related papers (2024-06-13T08:55:01Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification [8.880856137902947]
We introduce GlobalDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised manner.
GlobalDoc improves the learning of richer semantic concepts by unifying language and visual representations.
For proper evaluation, we also propose two novel document-level downstream VDU tasks, Few-Shot Document Image Classification (DIC) and Content-based Document Image Retrieval (DIR)
arXiv Detail & Related papers (2023-09-11T18:35:14Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - GVdoc: Graph-based Visual Document Classification [17.350393956461783]
We propose GVdoc, a graph-based document classification model.
Our approach generates a document graph based on its layout, and then trains a graph neural network to learn node and graph embeddings.
We show that our model, even with fewer parameters, outperforms state-of-the-art models on out-of-distribution data.
arXiv Detail & Related papers (2023-05-26T19:23:20Z) - TFS-ViT: Token-Level Feature Stylization for Domain Generalization [17.82872117103924]
Vision Transformers (ViTs) have shown outstanding performance for a broad range of computer vision tasks.
This paper presents a first Token-level Feature Stylization (TFS-ViT) approach for domain generalization.
Our approach transforms token features by mixing the normalization statistics of images from different domains.
arXiv Detail & Related papers (2023-03-28T03:00:28Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.