Notes on Applicability of GPT-4 to Document Understanding
- URL: http://arxiv.org/abs/2405.18433v1
- Date: Tue, 28 May 2024 17:59:53 GMT
- Title: Notes on Applicability of GPT-4 to Document Understanding
- Authors: Ćukasz Borchmann,
- Abstract summary: We evaluate all publicly available GPT-4 family models concerning the Document Understanding field.
Benchmark results indicate that though it is hard to achieve satisfactory results with text-only models, GPT-4 Vision Turbo performs well when one provides both text recognized by an external OCR engine and document images on the input.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We perform a missing, reproducible evaluation of all publicly available GPT-4 family models concerning the Document Understanding field, where it is frequently required to comprehend text spacial arrangement and visual clues in addition to textual semantics. Benchmark results indicate that though it is hard to achieve satisfactory results with text-only models, GPT-4 Vision Turbo performs well when one provides both text recognized by an external OCR engine and document images on the input. Evaluation is followed by analyses that suggest possible contamination of textual GPT-4 models and indicate the significant performance drop for lengthy documents.
Related papers
- ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models [92.60282074937305]
We introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images.
We conduct experiments to assess the performance of 14 foundation models and establish a human performance baseline.
We observe a significant performance gap of 30.8% between GPT-4V and human performance.
arXiv Detail & Related papers (2024-01-24T09:07:11Z) - GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition? [82.40761196684524]
This paper centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks.
We conduct extensive experiments to evaluate GPT-4's performance across images, videos, and point clouds.
Our findings show that GPT-4, enhanced with rich linguistic descriptions, significantly improves zero-shot recognition.
arXiv Detail & Related papers (2023-11-27T11:29:10Z) - Towards Improving Document Understanding: An Exploration on
Text-Grounding via MLLMs [96.54224331778195]
We present a text-grounding document understanding model, termed TGDoc, which enhances MLLMs with the ability to discern the spatial positioning of text within images.
We formulate instruction tuning tasks including text detection, recognition, and spotting to facilitate the cohesive alignment between the visual encoder and large language model.
Our method achieves state-of-the-art performance across multiple text-rich benchmarks, validating the effectiveness of our method.
arXiv Detail & Related papers (2023-11-22T06:46:37Z) - Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks [53.936643052339]
We evaluate the reasoning abilities of text-only and multimodal versions of GPT-4.
Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.
arXiv Detail & Related papers (2023-11-14T04:33:49Z) - GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment.
Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z) - Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and
In-depth Evaluation [33.66939971907121]
The evaluation reveals that GPT-4V performs well in recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks.
In general, despite its versatility in handling diverse OCR tasks, GPT-4V does not outperform existing state-of-the-art OCR models.
arXiv Detail & Related papers (2023-10-25T17:38:55Z) - An Early Evaluation of GPT-4V(ision) [40.866323649060696]
We evaluate different abilities of GPT-4V including visual understanding, language understanding, visual puzzle solving, and understanding of other modalities such as depth, thermal, video, and audio.
To estimate GPT-4V's performance, we manually construct 656 test instances and carefully evaluate the results of GPT-4V.
arXiv Detail & Related papers (2023-10-25T10:33:17Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.