Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and
In-depth Evaluation
- URL: http://arxiv.org/abs/2310.16809v2
- Date: Sun, 29 Oct 2023 10:59:21 GMT
- Title: Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and
In-depth Evaluation
- Authors: Yongxin Shi, Dezhi Peng, Wenhui Liao, Zening Lin, Xinhong Chen,
Chongyu Liu, Yuyi Zhang, Lianwen Jin
- Abstract summary: The evaluation reveals that GPT-4V performs well in recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks.
In general, despite its versatility in handling diverse OCR tasks, GPT-4V does not outperform existing state-of-the-art OCR models.
- Score: 33.66939971907121
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a comprehensive evaluation of the Optical Character
Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large
Multimodal Model (LMM). We assess the model's performance across a range of OCR
tasks, including scene text recognition, handwritten text recognition,
handwritten mathematical expression recognition, table structure recognition,
and information extraction from visually-rich document. The evaluation reveals
that GPT-4V performs well in recognizing and understanding Latin contents, but
struggles with multilingual scenarios and complex tasks. Specifically, it
showed limitations when dealing with non-Latin languages and complex tasks such
as handwriting mathematical expression recognition, table structure
recognition, and end-to-end semantic entity recognition and pair extraction
from document image. Based on these observations, we affirm the necessity and
continued research value of specialized OCR models. In general, despite its
versatility in handling diverse OCR tasks, GPT-4V does not outperform existing
state-of-the-art OCR models. How to fully utilize pre-trained general-purpose
LMMs such as GPT-4V for OCR downstream tasks remains an open problem. The study
offers a critical reference for future research in OCR with LMMs. Evaluation
pipeline and results are available at
https://github.com/SCUT-DLVCLab/GPT-4V_OCR.
Related papers
- See then Tell: Enhancing Key Information Extraction with Vision Grounding [54.061203106565706]
We introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding.
To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets.
arXiv Detail & Related papers (2024-09-29T06:21:05Z) - DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models.
We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z) - EfficientOCR: An Extensible, Open-Source Package for Efficiently
Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package.
It meets both the computational and sample efficiency requirements for liberating texts at scale.
EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - TransDocs: Optical Character Recognition with word to word translation [2.2336243882030025]
This research work focuses on improving the optical character recognition (OCR) with ML techniques.
This work is based on ANKI dataset for English to Spanish translation.
arXiv Detail & Related papers (2023-04-15T21:40:14Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - Donut: Document Understanding Transformer without OCR [17.397447819420695]
We propose a novel VDU model that is end-to-end trainable without underpinning OCR framework.
Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets.
arXiv Detail & Related papers (2021-11-30T18:55:19Z) - MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and
Understanding [70.16678926775475]
MMOCR is an open-source toolbox for text detection and recognition.
It implements 14 state-of-the-art algorithms, which is more than all the existing open-source OCR projects we are aware of to date.
arXiv Detail & Related papers (2021-08-14T14:10:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.