Related papers: Automated Invoice Data Extraction: Using LLM and OCR

Automated Invoice Data Extraction: Using LLM and OCR

URL: http://arxiv.org/abs/2511.05547v1
Date: Sat, 01 Nov 2025 19:05:09 GMT
Title: Automated Invoice Data Extraction: Using LLM and OCR
Authors: Advait Thakur, Khushi Khanchandani, Akshita Shetty, Chaitravi Reddy, Ritisa Behera,
Abstract summary: This work introduces a holistic Artificial Intelligence (AI) platform combining OCR, deep learning, Large Language Models (LLMs) and graph analytics to achieve unprecedented extraction quality and consistency.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conventional Optical Character Recognition (OCR) systems are challenged by variant invoice layouts, handwritten text, and low- quality scans, which are often caused by strong template dependencies that restrict their flexibility across different document structures and layouts. Newer solutions utilize advanced deep learning models such as Convolutional Neural Networks (CNN) as well as Transformers, and domain-specific models for better layout analysis and accuracy across various sections over varied document types. Large Language Models (LLMs) have revolutionized extraction pipelines at their core with sophisticated entity recognition and semantic comprehension to support complex contextual relationship mapping without direct programming specification. Visual Named Entity Recognition (NER) capabilities permit extraction from invoice images with greater contextual sensitivity and much higher accuracy rates than older approaches. Existing industry best practices utilize hybrid architectures that blend OCR technology and LLM for maximum scalability and minimal human intervention. This work introduces a holistic Artificial Intelligence (AI) platform combining OCR, deep learning, LLMs, and graph analytics to achieve unprecedented extraction quality and consistency.

Related papers

FireRed-OCR Technical Report [30.019999826760003]
We introduce FireRed-OCR, a framework to transform general-purpose VLMs into pixel-precise structural document parsing experts.<n>To address the scarcity of high-quality structured data, we construct a Geometry + Semantics'' Data Factory.<n>We propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation.
arXiv Detail & Related papers (2026-03-02T13:19:23Z)
FlipVQA-Miner: Cross-Page Visual Question-Answer Mining from Textbooks [9.040003496268314]
We propose an automated pipeline that extracts well-formed Question-Answer(QA) pairs from educational documents.<n> Experiments show that the method produces accurate, aligned, and low-noise QA/VQA pairs.
arXiv Detail & Related papers (2025-11-20T10:38:00Z)
Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z)
Generating Synthetic Invoices via Layout-Preserving Content Replacement [0.0]
We present a novel pipeline for generating high-fidelity, synthetic invoice documents and their corresponding structured data.<n>Our approach provides a scalable and automated solution to amplify small, private datasets.
arXiv Detail & Related papers (2025-08-04T06:19:34Z)
A Lightweight Multi-Module Fusion Approach for Korean Character Recognition [0.0]
SDA-Net is a lightweight and efficient architecture for robust single-character recognition.<n>It achieves state-of-the-art accuracy on challenging OCR benchmarks, with significantly faster inference.
arXiv Detail & Related papers (2025-04-08T07:50:19Z)
VISTA-OCR: Towards generative and interactive end to end OCR models [3.7548609506798494]
VISTA-OCR is a lightweight architecture that unifies text detection and recognition within a single generative model.<n>Built on an encoder-decoder architecture, VISTA-OCR is progressively trained, starting with the visual feature extraction phase.<n>To enhance the model's capabilities, we built a new dataset composed of real-world examples enriched with bounding box annotations and synthetic samples.
arXiv Detail & Related papers (2025-04-04T17:39:53Z)
QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding [53.69841526266547]
Fine-tuning a pre-trained Vision-Language Model with new datasets often falls short in optimizing the vision encoder.<n>We introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder.
arXiv Detail & Related papers (2025-04-03T18:47:16Z)
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding.<n>Existing solutions often rely on task-specific architectures and objectives for individual tasks.<n>In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z)
SOLO: A Single Transformer for Scalable Vision-Language Modeling [74.05173379908703]
We present SOLO, a single transformer for visiOn-Language mOdeling.<n>A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs.<n>In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM.
arXiv Detail & Related papers (2024-07-08T22:40:15Z)
Mixed Text Recognition with Efficient Parameter Fine-Tuning and Transformer [12.966765239586994]
This paper proposes DLoRA-TrOCR, a parameter-efficient hybrid text spotting method based on a pre-trained OCR Transformer.<n>By embedding a weight-decomposed DoRA module in the image encoder and a LoRA module in the text decoder, this method can be efficiently fine-tuned on various downstream tasks.<n> Experiments show that our proposed DLoRA-TrOCR outperforms other parameter-efficient fine-tuning methods in recognizing complex scenes with mixed handwritten, printed, and street text.
arXiv Detail & Related papers (2024-04-19T09:28:16Z)
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z)
Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs) We compare their accuracy and performance on widely used public datasets of scene and handwritten text. Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.