Digitization of Document and Information Extraction using OCR
- URL: http://arxiv.org/abs/2506.11156v1
- Date: Wed, 11 Jun 2025 16:03:01 GMT
- Title: Digitization of Document and Information Extraction using OCR
- Authors: Rasha Sinha, Rekha B S,
- Abstract summary: This document presents a framework for text extraction that merges Optical Character Recognition (OCR) techniques with Large Language Models (LLMs)<n>Scanned files are processed using OCR engines, while digital files are interpreted through layout-aware libraries.<n>The extracted raw text is then analyzed by an LLM to identify key-value pairs and resolve ambiguities.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Retrieving accurate details from documents is a crucial task, especially when handling a combination of scanned images and native digital formats. This document presents a combined framework for text extraction that merges Optical Character Recognition (OCR) techniques with Large Language Models (LLMs) to deliver structured outputs enriched by contextual understanding and confidence indicators. Scanned files are processed using OCR engines, while digital files are interpreted through layout-aware libraries. The extracted raw text is subsequently analyzed by an LLM to identify key-value pairs and resolve ambiguities. A comparative analysis of different OCR tools is presented to evaluate their effectiveness concerning accuracy, layout recognition, and processing speed. The approach demonstrates significant improvements over traditional rule-based and template-based methods, offering enhanced flexibility and semantic precision across different document categories
Related papers
- Words as Geometric Features: Estimating Homography using Optical Character Recognition as Compressed Image Representation [6.385732495789276]
Document alignment plays a crucial role in numerous real-world applications, such as automated form processing, anomaly detection, and workflow automation.<n>Traditional methods for document alignment rely on image-based features like keypoints, edges, and textures to estimate geometric transformations, such as homographies.<n>This paper introduces a novel approach that leverages Optical Character Recognition (OCR) outputs as features for homography estimation.
arXiv Detail & Related papers (2025-05-25T01:20:32Z) - TFIC: End-to-End Text-Focused Image Compression for Coding for Machines [50.86328069558113]
We present an image compression system designed to retain text-specific features for subsequent Optical Character Recognition (OCR)<n>Our encoding process requires half the time needed by the OCR module, making it especially suitable for devices with limited computational capacity.
arXiv Detail & Related papers (2025-03-25T09:36:13Z) - Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents [7.358946120326249]
We introduce 'Eclair, a text-extraction tool specifically designed to process a wide range of document types.<n>Given an image, 'Eclair is able to extract formatted text in reading order, along with bounding boxes and their corresponding semantic classes.<n>'Eclair achieves state-of-the-art accuracy on this benchmark, outperforming other methods across key metrics.
arXiv Detail & Related papers (2025-02-06T17:07:22Z) - UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model.
We show that UNIT significantly outperforms existing methods on document-related tasks.
Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z) - Unifying Multimodal Retrieval via Document Screenshot Embedding [92.03571344075607]
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that regards document screenshots as a unified input format.<n>We first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset.<n>For example, DSE outperforms BM25 by 17 points in top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide retrieval, DSE significantly outperforms OCR text retrieval methods by over 15 points in nDCG@10.
arXiv Detail & Related papers (2024-06-17T06:27:35Z) - Towards Unified Multi-granularity Text Detection with Interactive Attention [56.79437272168507]
"Detect Any Text" is an advanced paradigm that unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model.
A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances.
Tests demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks.
arXiv Detail & Related papers (2024-05-30T07:25:23Z) - Information Extraction from Unstructured data using Augmented-AI and Computer Vision [0.0]
This paper presents a framework for information extraction that combines Augmented Intelligence (A2I) with computer vision and natural language processing techniques.<n>Our approach addresses the limitations of conventional methods by leveraging deep learning architectures for object detection.<n>The proposed methodology demonstrates improved accuracy and efficiency in extracting structured information from diverse document formats.
arXiv Detail & Related papers (2023-12-15T15:27:41Z) - Optimization of Image Processing Algorithms for Character Recognition in
Cultural Typewritten Documents [0.8158530638728501]
This paper evaluates the impact of image processing methods and parameter tuning in Optical Character Recognition (OCR)
The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II)
Our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results.
arXiv Detail & Related papers (2023-11-27T11:44:46Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - Information Extraction from Scanned Invoice Images using Text Analysis
and Layout Features [0.0]
OCRMiner is designed to process documents in a similar way a human reader uses, i.e. to employ different layout and text attributes in a coordinated decision.
The system is able to recover the invoice data in 90% for English and in 88% for the Czech set.
arXiv Detail & Related papers (2022-08-08T09:46:33Z) - Detection Masking for Improved OCR on Noisy Documents [8.137198664755596]
We present an improved detection network with a masking system to improve the quality of OCR performed on documents.
We perform a unified evaluation on a publicly available dataset demonstrating the usefulness and broad applicability of our method.
arXiv Detail & Related papers (2022-05-17T11:59:18Z) - TRIE: End-to-End Text Reading and Information Extraction for Document
Understanding [56.1416883796342]
We propose a unified end-to-end text reading and information extraction network.
multimodal visual and textual features of text reading are fused for information extraction.
Our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.
arXiv Detail & Related papers (2020-05-27T01:47:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.