AMuRD: Annotated Arabic-English Receipt Dataset for Key Information Extraction and Classification
- URL: http://arxiv.org/abs/2309.09800v3
- Date: Tue, 26 Mar 2024 16:05:51 GMT
- Title: AMuRD: Annotated Arabic-English Receipt Dataset for Key Information Extraction and Classification
- Authors: Abdelrahman Abdallah, Mahmoud Abdalla, Mohamed Elkasaby, Yasser Elbendary, Adam Jatowt,
- Abstract summary: AMuRD is a novel multilingual human-annotated dataset specifically designed for information extraction from receipts.
Each sample includes annotations for item names and attributes such as price, brand, and more.
This detailed annotation facilitates a comprehensive understanding of each item on the receipt.
- Score: 14.386767741945256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The extraction of key information from receipts is a complex task that involves the recognition and extraction of text from scanned receipts. This process is crucial as it enables the retrieval of essential content and organizing it into structured documents for easy access and analysis. In this paper, we present AMuRD, a novel multilingual human-annotated dataset specifically designed for information extraction from receipts. This dataset comprises $47,720$ samples and addresses the key challenges in information extraction and item classification - the two critical aspects of data analysis in the retail industry. Each sample includes annotations for item names and attributes such as price, brand, and more. This detailed annotation facilitates a comprehensive understanding of each item on the receipt. Furthermore, the dataset provides classification into $44$ distinct product categories. This classification feature allows for a more organized and efficient analysis of the items, enhancing the usability of the dataset for various applications. In our study, we evaluated various language model architectures, e.g., by fine-tuning LLaMA models on the AMuRD dataset. Our approach yielded exceptional results, with an F1 score of 97.43\% and accuracy of 94.99\% in information extraction and classification, and an even higher F1 score of 98.51\% and accuracy of 97.06\% observed in specific tasks. The dataset and code are publicly accessible for further researchhttps://github.com/Update-For-Integrated-Business-AI/AMuRD.
Related papers
- CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset [12.828786692835369]
This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding dataset (CORU)
CORU consists of over 20,000 annotated receipts from diverse retail settings, including supermarkets and clothing stores.
We establish the baseline performance for a range of models on CORU to evaluate the effectiveness of traditional methods.
arXiv Detail & Related papers (2024-06-06T20:38:15Z) - Deep Learning Based Named Entity Recognition Models for Recipes [7.507956305171027]
Named entity recognition (NER) is a technique for extracting information from unstructured or semi-structured data with known labels.
We created an augmented dataset of 26,445 phrases cumulatively.
We analyzed ingredient phrases from RecipeDB, the gold-standard recipe data repository, and annotated them using the Stanford NER.
A thorough investigation of NER approaches on these datasets involving statistical, fine-tuning of deep learning-based language models provides deep insights.
arXiv Detail & Related papers (2024-02-27T12:03:56Z) - Distantly Supervised Morpho-Syntactic Model for Relation Extraction [0.27195102129094995]
We present a method for the extraction and categorisation of an unrestricted set of relationships from text.
We evaluate our approach on six datasets built on Wikidata and Wikipedia.
arXiv Detail & Related papers (2024-01-18T14:17:40Z) - ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction [52.14681890859275]
E-commerce platforms require structured product data in the form of attribute-value pairs.
BERT-based extraction methods require large amounts of task-specific training data.
This paper explores using large language models (LLMs) as a more training-data efficient and robust alternative.
arXiv Detail & Related papers (2023-10-19T07:39:00Z) - Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles [136.84278943588652]
We propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event.
To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm.
The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference.
arXiv Detail & Related papers (2023-09-17T20:28:17Z) - Knowledge-Enhanced Multi-Label Few-Shot Product Attribute-Value
Extraction [4.511923587827302]
Existing attribute-value extraction models require large quantities of labeled data for training.
New products with new attribute-value pairs enter the market every day in real-world e-Commerce.
We propose a Knowledge-Enhanced Attentive Framework (KEAF) based on networks to learn more discriminative prototypes.
arXiv Detail & Related papers (2023-08-16T14:58:12Z) - Product Information Extraction using ChatGPT [69.12244027050454]
This paper explores the potential of ChatGPT for extracting attribute/value pairs from product descriptions.
Our results show that ChatGPT achieves a performance similar to a pre-trained language model but requires much smaller amounts of training data and computation for fine-tuning.
arXiv Detail & Related papers (2023-06-23T09:30:01Z) - DocILE Benchmark for Document Information Localization and Extraction [7.944448547470927]
This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition.
It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly1M unlabeled documents for unsupervised pre-training.
arXiv Detail & Related papers (2023-02-11T11:32:10Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets.
We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy.
Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z) - Automatic Validation of Textual Attribute Values in E-commerce Catalog
by Learning with Limited Labeled Data [61.789797281676606]
We propose a novel meta-learning latent variable approach, called MetaBridge.
It can learn transferable knowledge from a subset of categories with limited labeled data.
It can capture the uncertainty of never-seen categories with unlabeled data.
arXiv Detail & Related papers (2020-06-15T21:31:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.