GPT-5 Model Corrected GPT-4V's Chart Reading Errors, Not Prompting
- URL: http://arxiv.org/abs/2510.06782v1
- Date: Wed, 08 Oct 2025 09:09:29 GMT
- Title: GPT-5 Model Corrected GPT-4V's Chart Reading Errors, Not Prompting
- Authors: Kaichun Yang, Jian Chen,
- Abstract summary: We present a quantitative evaluation to understand the effect of zero-shot large-language model (LLMs) on chart reading tasks.<n>We asked LLMs to answer 107 visualization questions to compare inference accuracies between the agentic GPT-5 and multimodal GPT-4V.<n>Our results show that model architecture dominates the inference accuracy, while prompt variants yielded only small effects.
- Score: 3.765281403026053
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a quantitative evaluation to understand the effect of zero-shot large-language model (LLMs) and prompting uses on chart reading tasks. We asked LLMs to answer 107 visualization questions to compare inference accuracies between the agentic GPT-5 and multimodal GPT-4V, for difficult image instances, where GPT-4V failed to produce correct answers. Our results show that model architecture dominates the inference accuracy: GPT5 largely improved accuracy, while prompt variants yielded only small effects. Pre-registration of this work is available here: https://osf.io/u78td/?view_only=6b075584311f48e991c39335c840ded3; the Google Drive materials are here:https://drive.google.com/file/d/1ll8WWZDf7cCNcfNWrLViWt8GwDNSvVrp/view.
Related papers
- OpenAI GPT-5 System Card [247.27796140570612]
GPT-5 is a unified system with a smart and fast model that answers most questions.<n>A real-time router decides which model to use based on conversation type, complexity, tool needs, and explicit intent.<n>Once usage limits are reached, a mini version of each model handles remaining queries.
arXiv Detail & Related papers (2025-12-19T07:05:38Z) - Exploring ChatGPT for Face Presentation Attack Detection in Zero and Few-Shot in-Context Learning [6.537257913467247]
This study highlights the potential of ChatGPT (specifically GPT-4o) as a competitive alternative for Face Presentation Attack Detection (PAD)<n>Our results show that GPT-4o demonstrates high consistency, particularly in few-shot in-context learning.<n>Remarkably, the model exhibits emergent reasoning capabilities, correctly predicting the attack type (print or replay) with high accuracy in few-shot scenarios.
arXiv Detail & Related papers (2025-01-15T13:46:33Z) - GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition? [82.40761196684524]
This paper centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks.
We conduct extensive experiments to evaluate GPT-4's performance across images, videos, and point clouds.
Our findings show that GPT-4, enhanced with rich linguistic descriptions, significantly improves zero-shot recognition.
arXiv Detail & Related papers (2023-11-27T11:29:10Z) - NERIF: GPT-4V for Automatic Scoring of Drawn Models [0.6278186810520364]
Recently released GPT-4V provides a unique opportunity to advance scientific modeling practices.
We developed a method employing instructional note and rubrics to prompt GPT-4V to score students' drawn models.
GPT-4V scores were compared with human experts' scores to calculate scoring accuracy.
arXiv Detail & Related papers (2023-11-21T20:52:04Z) - An Early Evaluation of GPT-4V(ision) [40.866323649060696]
We evaluate different abilities of GPT-4V including visual understanding, language understanding, visual puzzle solving, and understanding of other modalities such as depth, thermal, video, and audio.
To estimate GPT-4V's performance, we manually construct 656 test instances and carefully evaluate the results of GPT-4V.
arXiv Detail & Related papers (2023-10-25T10:33:17Z) - Evaluating Large Language Models on Graphs: Performance Insights and
Comparative Analysis [7.099257763803159]
We evaluate the capabilities of four Large Language Models (LLMs) in addressing several analytical problems with graph data.
We employ four distinct evaluation metrics: Correctness, Fidelity, and Rectification.
GPT models can generate logical and coherent results, outperforming alternatives in correctness.
arXiv Detail & Related papers (2023-08-22T06:32:07Z) - IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models [80.4460931172274]
We develop a framework that decomposes vision-and-language (VL) reasoning using large language models (LLMs)<n>In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE.
arXiv Detail & Related papers (2023-05-24T10:19:57Z) - Visual Instruction Tuning [79.70923292053097]
We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.
By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant.
When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.
arXiv Detail & Related papers (2023-04-17T17:59:25Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - Improving Zero-shot Generalization and Robustness of Multi-modal Models [70.14692320804178]
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks.
We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts.
We propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy.
arXiv Detail & Related papers (2022-12-04T07:26:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.