CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset
- URL: http://arxiv.org/abs/2406.04493v1
- Date: Thu, 6 Jun 2024 20:38:15 GMT
- Title: CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset
- Authors: Abdelrahman Abdallah, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Ibrahim Abdelhalim, Mohamed Elkasaby, Yasser ElBendary, Adam Jatowt,
- Abstract summary: This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding dataset (CORU)
CORU consists of over 20,000 annotated receipts from diverse retail settings, including supermarkets and clothing stores.
We establish the baseline performance for a range of models on CORU to evaluate the effectiveness of traditional methods.
- Score: 12.828786692835369
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the fields of Optical Character Recognition (OCR) and Natural Language Processing (NLP), integrating multilingual capabilities remains a critical challenge, especially when considering languages with complex scripts such as Arabic. This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding Dataset (CORU), a novel dataset specifically designed to enhance OCR and information extraction from receipts in multilingual contexts involving Arabic and English. CORU consists of over 20,000 annotated receipts from diverse retail settings, including supermarkets and clothing stores, alongside 30,000 annotated images for OCR that were utilized to recognize each detected line, and 10,000 items annotated for detailed information extraction. These annotations capture essential details such as merchant names, item descriptions, total prices, receipt numbers, and dates. They are structured to support three primary computational tasks: object detection, OCR, and information extraction. We establish the baseline performance for a range of models on CORU to evaluate the effectiveness of traditional methods, like Tesseract OCR, and more advanced neural network-based approaches. These baselines are crucial for processing the complex and noisy document layouts typical of real-world receipts and for advancing the state of automated multilingual document processing. Our datasets are publicly accessible (https://github.com/Update-For-Integrated-Business-AI/CORU).
Related papers
- Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and
In-depth Evaluation [33.66939971907121]
The evaluation reveals that GPT-4V performs well in recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks.
In general, despite its versatility in handling diverse OCR tasks, GPT-4V does not outperform existing state-of-the-art OCR models.
arXiv Detail & Related papers (2023-10-25T17:38:55Z) - EfficientOCR: An Extensible, Open-Source Package for Efficiently
Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package.
It meets both the computational and sample efficiency requirements for liberating texts at scale.
EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z) - AMuRD: Annotated Arabic-English Receipt Dataset for Key Information Extraction and Classification [14.386767741945256]
AMuRD is a novel multilingual human-annotated dataset specifically designed for information extraction from receipts.
Each sample includes annotations for item names and attributes such as price, brand, and more.
This detailed annotation facilitates a comprehensive understanding of each item on the receipt.
arXiv Detail & Related papers (2023-09-18T14:18:19Z) - bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents [0.23639235997306196]
We introduce Bengali$.$AI-BRACU-OCR (bbOCR), an open-source scalable document OCR system that can reconstruct Bengali documents into a structured searchable digitized format.
Our proposed solution is preferable over the current state-of-the-art Bengali OCR systems.
arXiv Detail & Related papers (2023-08-21T11:35:28Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z) - ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction [70.71240097723745]
In recognition of the technical challenges, importance and huge commercial potentials of SROIE, we organized the ICDAR 2019 competition on SROIE.
A new dataset with 1000 whole scanned receipt images and annotations is created for the competition.
In this report we will presents the motivation, competition datasets, task definition, evaluation protocol, submission statistics, performance of submitted methods and results analysis.
arXiv Detail & Related papers (2021-03-18T12:33:41Z) - Cross-Lingual Low-Resource Set-to-Description Retrieval for Global
E-Commerce [83.72476966339103]
Cross-lingual information retrieval is a new task in cross-border e-commerce.
We propose a novel cross-lingual matching network (CLMN) with the enhancement of context-dependent cross-lingual mapping.
Experimental results indicate that our proposed CLMN yields impressive results on the challenging task.
arXiv Detail & Related papers (2020-05-17T08:10:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.