GPT Takes the Bar Exam
- URL: http://arxiv.org/abs/2212.14402v1
- Date: Thu, 29 Dec 2022 18:19:43 GMT
- Title: GPT Takes the Bar Exam
- Authors: Michael Bommarito II, Daniel Martin Katz
- Abstract summary: We document our experimental evaluation of the performance of OpenAI's text-davinci-003 model, often-referred to as GPT-3.5.
For best prompt and parameters, GPT-3.5 achieves a headline correct rate of 50.3% on a complete NCBE MBE practice exam.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Nearly all jurisdictions in the United States require a professional license
exam, commonly referred to as "the Bar Exam," as a precondition for law
practice. To even sit for the exam, most jurisdictions require that an
applicant completes at least seven years of post-secondary education, including
three years at an accredited law school. In addition, most test-takers also
undergo weeks to months of further, exam-specific preparation. Despite this
significant investment of time and capital, approximately one in five
test-takers still score under the rate required to pass the exam on their first
try. In the face of a complex task that requires such depth of knowledge, what,
then, should we expect of the state of the art in "AI?" In this research, we
document our experimental evaluation of the performance of OpenAI's
`text-davinci-003` model, often-referred to as GPT-3.5, on the multistate
multiple choice (MBE) section of the exam. While we find no benefit in
fine-tuning over GPT-3.5's zero-shot performance at the scale of our training
data, we do find that hyperparameter optimization and prompt engineering
positively impacted GPT-3.5's zero-shot performance. For best prompt and
parameters, GPT-3.5 achieves a headline correct rate of 50.3% on a complete
NCBE MBE practice exam, significantly in excess of the 25% baseline guessing
rate, and performs at a passing rate for both Evidence and Torts. GPT-3.5's
ranking of responses is also highly-correlated with correctness; its top two
and top three choices are correct 71% and 88% of the time, respectively,
indicating very strong non-entailment performance. While our ability to
interpret these results is limited by nascent scientific understanding of LLMs
and the proprietary nature of GPT, we believe that these results strongly
suggest that an LLM will pass the MBE component of the Bar Exam in the near
future.
Related papers
- CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models [61.324062412648075]
CFinBench is an evaluation benchmark for assessing the financial knowledge of large language models (LLMs) under Chinese context.
It comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment.
The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%.
arXiv Detail & Related papers (2024-07-02T14:34:36Z) - GPT-4 passes most of the 297 written Polish Board Certification Examinations [0.5461938536945723]
This study evaluated the performance of three Generative Pretrained Transformer (GPT) models on the Polish Board Certification Exam (Pa'nstwowy Egzamin Specjalizacyjny, PES) dataset.
The GPT models varied significantly, displaying excellence in exams related to certain specialties while completely failing others.
arXiv Detail & Related papers (2024-04-29T09:08:22Z) - Evaluating the Performance of Large Language Models for Spanish Language
in Undergraduate Admissions Exams [4.974500659156055]
This study evaluates the performance of large language models, specifically GPT-3.5 and BARD, in undergraduate admissions exams proposed by the National Polytechnic Institute in Mexico.
Both models demonstrated proficiency, exceeding the minimum acceptance scores for respective academic programs to up to 75% for some academic programs.
arXiv Detail & Related papers (2023-12-28T06:23:39Z) - How is ChatGPT's behavior changing over time? [72.79311931941876]
We evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4.
We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time.
arXiv Detail & Related papers (2023-07-18T06:56:08Z) - Professional Certification Benchmark Dataset: The First 500 Jobs For
Large Language Models [0.0]
The research creates a professional certification survey to test large language models and evaluate their employable skills.
It compares the performance of two AI models, GPT-3 and Turbo-GPT3.5, on a benchmark dataset of 1149 professional certifications.
arXiv Detail & Related papers (2023-05-07T00:56:58Z) - AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models [122.63704560157909]
We introduce AGIEval, a novel benchmark designed to assess foundation model in the context of human-centric standardized exams.
We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003.
GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam.
arXiv Detail & Related papers (2023-04-13T09:39:30Z) - Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission
Exams [4.2706617195518195]
This study aims to explore the capabilities of Language Models (LMs) in tackling high-stakes multiple-choice tests.
This exam poses challenging tasks for LMs, since its questions may span into multiple fields of knowledge.
The best-performing model, GPT-4 with Chain-of-Thought prompts, achieved an accuracy of 87%, largely surpassing GPT-3.5 by 11 points.
arXiv Detail & Related papers (2023-03-29T20:10:13Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language
Understanding Tasks [65.7949334650854]
GPT-3.5 models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks.
However, their robustness and abilities to handle various complexities of the open world have yet to be explored.
We show that GPT-3.5 faces some specific robustness challenges, including instability, prompt sensitivity, and number sensitivity.
arXiv Detail & Related papers (2023-03-01T07:39:01Z) - GPT as Knowledge Worker: A Zero-Shot Evaluation of (AI)CPA Capabilities [0.0]
We experimentally evaluate OpenAI's text-davinci-003 and prior versions of GPT on a sample Regulation (REG) exam.
We find that text-davinci-003 achieves a correct rate of 14.4% on a sample REG exam section, significantly underperforming human capabilities on quantitative reasoning in zero-shot prompts.
For best prompt and parameters, the model answers 57.6% of questions correctly, significantly better than the 25% guessing rate, and its top two answers are correct 82.1% of the time, indicating strong non-entailment.
arXiv Detail & Related papers (2023-01-11T11:30:42Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.