A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on
Reasoning, Hallucination, and Interactivity
- URL: http://arxiv.org/abs/2302.04023v4
- Date: Tue, 28 Nov 2023 09:01:12 GMT
- Title: A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on
Reasoning, Hallucination, and Interactivity
- Authors: Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su,
Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do,
Yan Xu, Pascale Fung
- Abstract summary: We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks.
We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset.
ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
- Score: 79.12003701981092
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper proposes a framework for quantitatively evaluating interactive
LLMs such as ChatGPT using publicly available data sets. We carry out an
extensive technical evaluation of ChatGPT using 23 data sets covering 8
different common NLP application tasks. We evaluate the multitask, multilingual
and multi-modal aspects of ChatGPT based on these data sets and a newly
designed multimodal dataset. We find that ChatGPT outperforms LLMs with
zero-shot learning on most tasks and even outperforms fine-tuned models on some
tasks. We find that it is better at understanding non-Latin script languages
than generating them. It is able to generate multimodal content from textual
prompts, via an intermediate code generation step. Moreover, we find that
ChatGPT is 63.41% accurate on average in 10 different reasoning categories
under logical reasoning, non-textual reasoning, and commonsense reasoning,
hence making it an unreliable reasoner. It is, for example, better at deductive
than inductive reasoning. ChatGPT suffers from hallucination problems like
other LLMs and it generates more extrinsic hallucinations from its parametric
memory as it does not have access to an external knowledge base. Finally, the
interactive feature of ChatGPT enables human collaboration with the underlying
LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++
on machine translation, in a multi-turn "prompt engineering" fashion. We also
release codebase for evaluation set extraction.
Related papers
- How Good is ChatGPT at Face Biometrics? A First Look into Recognition,
Soft Biometrics, and Explainability [17.85111188884935]
ChatGPT allows anyone to interact in a simple conversational way with large language models.
We analyze the ability of ChatGPT to perform tasks such as face verification, soft-biometrics estimation, and explainability of the results.
arXiv Detail & Related papers (2024-01-24T18:10:39Z) - Pushing the Limits of ChatGPT on NLP Tasks [79.17291002710517]
Despite the success of ChatGPT, its performances on most NLP tasks are still well below the supervised baselines.
In this work, we looked into the causes, and discovered that its subpar performance was caused by the following factors.
We propose a collection of general modules to address these issues, in an attempt to push the limits of ChatGPT on NLP tasks.
arXiv Detail & Related papers (2023-06-16T09:40:05Z) - A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark
Datasets [19.521390684403293]
We present a thorough evaluation of ChatGPT's performance on diverse academic datasets.
Specifically, we evaluate ChatGPT across 140 tasks and analyze 255K responses it generates in these datasets.
arXiv Detail & Related papers (2023-05-29T12:37:21Z) - ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large
Language Models [125.7209927536255]
We propose ChatCoT, a tool-augmented chain-of-thought reasoning framework for chat-based LLMs.
In ChatCoT, we model the chain-of-thought (CoT) reasoning as multi-turn conversations, to utilize tools in a more natural way through chatting.
Our approach can effectively leverage the multi-turn conversation ability of chat-based LLMs, and integrate the thought chain following and tools manipulation in a unified way.
arXiv Detail & Related papers (2023-05-23T17:54:33Z) - Automatic Code Summarization via ChatGPT: How Far Are We? [10.692654700225411]
We evaluate ChatGPT on a widely-used Python dataset called CSN-Python.
In terms of BLEU and ROUGE-L, ChatGPT's code summarization performance is significantly worse than all three SOTA models.
Based on the findings, we outline several open challenges and opportunities in ChatGPT-based code summarization.
arXiv Detail & Related papers (2023-05-22T09:43:40Z) - ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large
Language Models in Multilingual Learning [70.57126720079971]
Large language models (LLMs) have emerged as the most important breakthroughs in natural language processing (NLP)
This paper evaluates ChatGPT on 7 different tasks, covering 37 diverse languages with high, medium, low, and extremely low resources.
Compared to the performance of previous models, our extensive experimental results demonstrate a worse performance of ChatGPT for different NLP tasks and languages.
arXiv Detail & Related papers (2023-04-12T05:08:52Z) - When do you need Chain-of-Thought Prompting for ChatGPT? [87.45382888430643]
Chain-of-Thought (CoT) prompting can effectively elicit complex multi-step reasoning from Large Language Models(LLMs)
It is not clear whether CoT is still effective on more recent instruction finetuned (IFT) LLMs such as ChatGPT.
arXiv Detail & Related papers (2023-04-06T17:47:29Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.