GPT as Knowledge Worker: A Zero-Shot Evaluation of (AI)CPA Capabilities
- URL: http://arxiv.org/abs/2301.04408v1
- Date: Wed, 11 Jan 2023 11:30:42 GMT
- Title: GPT as Knowledge Worker: A Zero-Shot Evaluation of (AI)CPA Capabilities
- Authors: Jillian Bommarito, Michael Bommarito, Daniel Martin Katz, Jessica Katz
- Abstract summary: We experimentally evaluate OpenAI's text-davinci-003 and prior versions of GPT on a sample Regulation (REG) exam.
We find that text-davinci-003 achieves a correct rate of 14.4% on a sample REG exam section, significantly underperforming human capabilities on quantitative reasoning in zero-shot prompts.
For best prompt and parameters, the model answers 57.6% of questions correctly, significantly better than the 25% guessing rate, and its top two answers are correct 82.1% of the time, indicating strong non-entailment.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The global economy is increasingly dependent on knowledge workers to meet the
needs of public and private organizations. While there is no single definition
of knowledge work, organizations and industry groups still attempt to measure
individuals' capability to engage in it. The most comprehensive assessment of
capability readiness for professional knowledge workers is the Uniform CPA
Examination developed by the American Institute of Certified Public Accountants
(AICPA). In this paper, we experimentally evaluate OpenAI's `text-davinci-003`
and prior versions of GPT on both a sample Regulation (REG) exam and an
assessment of over 200 multiple-choice questions based on the AICPA Blueprints
for legal, financial, accounting, technology, and ethical tasks. First, we find
that `text-davinci-003` achieves a correct rate of 14.4% on a sample REG exam
section, significantly underperforming human capabilities on quantitative
reasoning in zero-shot prompts. Second, `text-davinci-003` appears to be
approaching human-level performance on the Remembering & Understanding and
Application skill levels in the Exam absent calculation. For best prompt and
parameters, the model answers 57.6% of questions correctly, significantly
better than the 25% guessing rate, and its top two answers are correct 82.1% of
the time, indicating strong non-entailment. Finally, we find that recent
generations of GPT-3 demonstrate material improvements on this assessment,
rising from 30% for `text-davinci-001` to 57% for `text-davinci-003`. These
findings strongly suggest that large language models have the potential to
transform the quality and efficiency of future knowledge work.
Related papers
- Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants [175.9723801486487]
We evaluate whether two AI assistants, GPT-3.5 and GPT-4, can adequately answer assessment questions.
GPT-4 answers an average of 65.8% of questions correctly, and can even produce the correct answer across at least one prompting strategy for 85.1% of questions.
Our results call for revising program-level assessment design in higher education in light of advances in generative AI.
arXiv Detail & Related papers (2024-08-07T12:11:49Z) - On Evaluating Explanation Utility for Human-AI Decision Making in NLP [39.58317527488534]
We review existing metrics suitable for application-grounded evaluation.
We demonstrate the importance of reassessing the state of the art to form and study human-AI teams.
arXiv Detail & Related papers (2024-07-03T23:53:27Z) - Evaluating AI Vocational Skills Through Professional Testing [0.0]
The study focuses on assessing the vocational skills of two AI models, GPT-3 and Turbo-GPT3.5.
Both models scored well on sensory and experience-based tests outside a machine's traditional roles.
The study found that OpenAI's model improvement from Babbage to Turbo led to a 60% better performance on the grading scale within a few years.
arXiv Detail & Related papers (2023-12-17T04:41:59Z) - FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long
Form Text Generation [176.56131810249602]
evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial.
We introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source.
arXiv Detail & Related papers (2023-05-23T17:06:00Z) - AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models [122.63704560157909]
We introduce AGIEval, a novel benchmark designed to assess foundation model in the context of human-centric standardized exams.
We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003.
GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam.
arXiv Detail & Related papers (2023-04-13T09:39:30Z) - How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language
Understanding Tasks [65.7949334650854]
GPT-3.5 models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks.
However, their robustness and abilities to handle various complexities of the open world have yet to be explored.
We show that GPT-3.5 faces some specific robustness challenges, including instability, prompt sensitivity, and number sensitivity.
arXiv Detail & Related papers (2023-03-01T07:39:01Z) - GPT Takes the Bar Exam [0.0]
We document our experimental evaluation of the performance of OpenAI's text-davinci-003 model, often-referred to as GPT-3.5.
For best prompt and parameters, GPT-3.5 achieves a headline correct rate of 50.3% on a complete NCBE MBE practice exam.
arXiv Detail & Related papers (2022-12-29T18:19:43Z) - Predicting article quality scores with machine learning: The UK Research
Excellence Framework [6.582887504429817]
Accuracy is highest in the medical and physical sciences Units of Assessment (UoAs) and economics.
Prediction accuracies above the baseline for the social science, mathematics, engineering, arts, humanities, and UoAs were much lower or close to zero.
We increased accuracy with an active learning strategy and by selecting articles with higher prediction probabilities, as estimated by the algorithms, but this substantially reduced the number of scores predicted.
arXiv Detail & Related papers (2022-12-11T05:45:12Z) - Using Sampling to Estimate and Improve Performance of Automated Scoring
Systems with Guarantees [63.62448343531963]
We propose a combination of the existing paradigms, sampling responses to be scored by humans intelligently.
We observe significant gains in accuracy (19.80% increase on average) and quadratic weighted kappa (QWK) (25.60% on average) with a relatively small human budget.
arXiv Detail & Related papers (2021-11-17T05:00:51Z) - COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs [82.8453695903687]
We show that manually constructed commonsense knowledge graphs (CSKGs) will never achieve the coverage necessary to be applicable in all situations encountered by NLP agents.
We propose ATOMIC 2020, a new CSKG of general-purpose commonsense knowledge containing knowledge that is not readily available in pretrained language models.
We evaluate its properties in comparison with other leading CSKGs, performing the first large-scale pairwise study of commonsense knowledge resources.
arXiv Detail & Related papers (2020-10-12T18:27:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.