Can GPT-3 Perform Statutory Reasoning?
- URL: http://arxiv.org/abs/2302.06100v2
- Date: Wed, 10 May 2023 19:17:23 GMT
- Title: Can GPT-3 Perform Statutory Reasoning?
- Authors: Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme
- Abstract summary: We explore the capabilities of the most capable GPT-3 model, text-davinci-003, on an established statutory-reasoning dataset called SARA.
We find GPT-3 performs poorly at answering straightforward questions about simple synthetic statutes.
- Score: 37.66486350122862
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Statutory reasoning is the task of reasoning with facts and statutes, which
are rules written in natural language by a legislature. It is a basic legal
skill. In this paper we explore the capabilities of the most capable GPT-3
model, text-davinci-003, on an established statutory-reasoning dataset called
SARA. We consider a variety of approaches, including dynamic few-shot
prompting, chain-of-thought prompting, and zero-shot prompting. While we
achieve results with GPT-3 that are better than the previous best published
results, we also identify several types of clear errors it makes. We
investigate why these errors happen. We discover that GPT-3 has imperfect prior
knowledge of the actual U.S. statutes on which SARA is based. More importantly,
we create simple synthetic statutes, which GPT-3 is guaranteed not to have seen
during training. We find GPT-3 performs poorly at answering straightforward
questions about these simple synthetic statutes.
Related papers
- Large Language Models in Cryptocurrency Securities Cases: Can a GPT
Model Meaningfully Assist Lawyers? [0.3441021278275805]
We study GPT-3.5's legal reasoning and ChatGPT's legal drafting capabilities.
We feed fact patterns from real-life cases to GPT-3.5 and evaluate its ability to determine correct potential violations.
Second, we had mock jurors assess complaints written by ChatGPT and lawyers.
arXiv Detail & Related papers (2023-08-11T09:23:11Z) - How is ChatGPT's behavior changing over time? [72.79311931941876]
We evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4.
We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time.
arXiv Detail & Related papers (2023-07-18T06:56:08Z) - Reliability Check: An Analysis of GPT-3's Response to Sensitive Topics
and Prompt Wording [0.0]
We analyze what confuses GPT-3: how the model responds to certain sensitive topics and what effects the prompt wording has on the model response.
We find that GPT-3 correctly disagrees with obvious Conspiracies and Stereotypes but makes mistakes with common Misconceptions and Controversies.
The model responses are inconsistent across prompts and settings, highlighting GPT-3's unreliability.
arXiv Detail & Related papers (2023-06-09T19:07:31Z) - Gpt-4: A Review on Advancements and Opportunities in Natural Language
Processing [0.0]
Generative Pre-trained Transformer 4 (GPT-4) is the fourth-generation language model in the GPT series, developed by OpenAI.
GPT-4 has a larger model size (more than one trillion), better multilingual capabilities, improved contextual understanding, and reasoning capabilities than GPT-3.
Some of the potential applications of GPT-4 include chatbots, personal assistants, language translation, text summarization, and question-answering.
arXiv Detail & Related papers (2023-05-04T22:46:43Z) - Systematicity in GPT-3's Interpretation of Novel English Noun Compounds [7.039267642892591]
We compare Levin et al.'s experimental data with GPT-3 generations, finding a high degree of similarity.
We fail to find convincing evidence that GPT-3 is reasoning about more than just individual lexical items.
These results highlight the importance of controlling for low-level distributional regularities when assessing whether a large language model latently encodes a deeper theory.
arXiv Detail & Related papers (2022-10-18T00:25:24Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z) - Who is GPT-3? An Exploration of Personality, Values and Demographics [0.4791233143264229]
Language models such as GPT-3 have caused a furore in the research community.
This paper answers a related question: who is GPT-3?
arXiv Detail & Related papers (2022-09-28T18:07:02Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z) - The Unreliability of Explanations in Few-Shot In-Context Learning [50.77996380021221]
We focus on two NLP tasks that involve reasoning over text, namely question answering and natural language inference.
We show that explanations judged as good by humans--those that are logically consistent with the input--usually indicate more accurate predictions.
We present a framework for calibrating model predictions based on the reliability of the explanations.
arXiv Detail & Related papers (2022-05-06T17:57:58Z) - Memory-assisted prompt editing to improve GPT-3 after deployment [55.62352349324132]
We show how a (simulated) user can interactively teach a deployed GPT-3, doubling its accuracy on basic lexical tasks.
Our simple idea is a first step towards strengthening deployed models, potentially broadening their utility.
arXiv Detail & Related papers (2022-01-16T10:11:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.