Information Extraction from Scanned Invoice Images using Text Analysis
and Layout Features
- URL: http://arxiv.org/abs/2208.04011v1
- Date: Mon, 8 Aug 2022 09:46:33 GMT
- Title: Information Extraction from Scanned Invoice Images using Text Analysis
and Layout Features
- Authors: Hien Thi Ha and Ale\v{s} Hor\'ak
- Abstract summary: OCRMiner is designed to process documents in a similar way a human reader uses, i.e. to employ different layout and text attributes in a coordinated decision.
The system is able to recover the invoice data in 90% for English and in 88% for the Czech set.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While storing invoice content as metadata to avoid paper document processing
may be the future trend, almost all of daily issued invoices are still printed
on paper or generated in digital formats such as PDFs. In this paper, we
introduce the OCRMiner system for information extraction from scanned document
images which is based on text analysis techniques in combination with layout
features to extract indexing metadata of (semi-)structured documents. The
system is designed to process the document in a similar way a human reader
uses, i.e. to employ different layout and text attributes in a coordinated
decision. The system consists of a set of interconnected modules that start
with (possibly erroneous) character-based output from a standard OCR system and
allow to apply different techniques and to expand the extracted knowledge at
each step. Using an open source OCR, the system is able to recover the invoice
data in 90% for English and in 88% for the Czech set.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Unifying Multimodal Retrieval via Document Screenshot Embedding [92.03571344075607]
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that regards document screenshots as a unified input format.
We first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset.
In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing.
arXiv Detail & Related papers (2024-06-17T06:27:35Z) - Enhancement of Bengali OCR by Specialized Models and Advanced Techniques
for Diverse Document Types [1.2499537119440245]
This research paper presents a unique Bengali OCR system with some capabilities.
The system excels in reconstructing document layouts while preserving structure, alignment, and images.
Specialized models for word segmentation cater to diverse document types, including computer-composed, letterpress, typewriter, and handwritten documents.
arXiv Detail & Related papers (2024-02-07T18:02:33Z) - Automatic Recognition of Learning Resource Category in a Digital Library [6.865460045260549]
We introduce the Heterogeneous Learning Resources (HLR) dataset designed for document image classification.
The approach involves decomposing individual learning resources into constituent document images (sheets)
These images are then processed through an OCR tool to extract textual representation.
arXiv Detail & Related papers (2023-11-28T07:48:18Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - Document Layout Annotation: Database and Benchmark in the Domain of
Public Affairs [62.38140271294419]
We propose a procedure to semi-automatically annotate digital documents with different layout labels.
We collect a novel database for DLA in the public affairs domain using a set of 24 data sources from the Spanish Administration.
The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%.
arXiv Detail & Related papers (2023-06-12T08:21:50Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Combining Deep Learning and Reasoning for Address Detection in
Unstructured Text Documents [0.0]
We propose a hybrid approach that combines deep learning with reasoning for finding and extracting addresses from unstructured text documents.
We use a visual deep learning model to detect the boundaries of possible address regions on the scanned document images.
arXiv Detail & Related papers (2022-02-07T12:32:00Z) - Abstractive Information Extraction from Scanned Invoices (AIESI) using
End-to-end Sequential Approach [0.0]
We are interested in data like, Payee name, total amount, address, and etc.
Extracted information helps to get complete insight of data, which can be helpful for fast document searching, efficient indexing in databases, data analytics, and etc.
In this paper we proposed an improved method to ensemble all visual and textual features from invoices to extract key invoice parameters using Word wise BiLSTM.
arXiv Detail & Related papers (2020-09-12T05:14:28Z) - TRIE: End-to-End Text Reading and Information Extraction for Document
Understanding [56.1416883796342]
We propose a unified end-to-end text reading and information extraction network.
multimodal visual and textual features of text reading are fused for information extraction.
Our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.
arXiv Detail & Related papers (2020-05-27T01:47:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.