Can GPT-4 Support Analysis of Textual Data in Tasks Requiring Highly
Specialized Domain Expertise?
- URL: http://arxiv.org/abs/2306.13906v1
- Date: Sat, 24 Jun 2023 08:48:24 GMT
- Title: Can GPT-4 Support Analysis of Textual Data in Tasks Requiring Highly
Specialized Domain Expertise?
- Authors: Jaromir Savelka, Kevin D. Ashley, Morgan A Gray, Hannes Westermann,
Huihui Xu
- Abstract summary: GPT-4, prompted with annotation guidelines, performs on par with well-trained law student annotators.
We demonstrated how to analyze GPT-4's predictions to identify and mitigate deficiencies in annotation guidelines.
- Score: 0.8924669503280334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We evaluated the capability of generative pre-trained transformers~(GPT-4) in
analysis of textual data in tasks that require highly specialized domain
expertise. Specifically, we focused on the task of analyzing court opinions to
interpret legal concepts. We found that GPT-4, prompted with annotation
guidelines, performs on par with well-trained law student annotators. We
observed that, with a relatively minor decrease in performance, GPT-4 can
perform batch predictions leading to significant cost reductions. However,
employing chain-of-thought prompting did not lead to noticeably improved
performance on this task. Further, we demonstrated how to analyze GPT-4's
predictions to identify and mitigate deficiencies in annotation guidelines, and
subsequently improve the performance of the model. Finally, we observed that
the model is quite brittle, as small formatting related changes in the prompt
had a high impact on the predictions. These findings can be leveraged by
researchers and practitioners who engage in semantic/pragmatic annotations of
texts in the context of the tasks requiring highly specialized domain
expertise.
Related papers
- Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams [48.99818550820575]
We leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams.
Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques.
arXiv Detail & Related papers (2024-11-07T22:51:47Z) - Can GPT-4 learn to analyse moves in research article abstracts? [0.9999629695552195]
We employ the affordances of GPT-4 to automate the annotation process by using natural language prompts.
An 8-shot prompt was more effective than one using two, confirming that the inclusion of examples illustrating areas of variability can enhance GPT-4's ability to recognize multiple moves in a single sentence.
arXiv Detail & Related papers (2024-07-22T13:14:27Z) - Identifying and Improving Disability Bias in GPT-Based Resume Screening [9.881826151448198]
We ask ChatGPT to rank a resume against the same resume enhanced with an additional leadership award, scholarship, panel presentation, and membership that are disability related.
We find that GPT-4 exhibits prejudice towards these enhanced CVs.
We show that this prejudice can be quantifiably reduced by training a custom GPTs on principles of DEI and disability justice.
arXiv Detail & Related papers (2024-01-28T17:04:59Z) - GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment.
Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z) - Black-Box Analysis: GPTs Across Time in Legal Textual Entailment Task [17.25356594832692]
We present an analysis of GPT-3.5 (ChatGPT) and GPT-4 performances on COLIEE Task 4 dataset.
Our preliminary experimental results unveil intriguing insights into the models' strengths and weaknesses in handling legal textual entailment tasks.
arXiv Detail & Related papers (2023-09-11T14:43:54Z) - Comparative Analysis of GPT-4 and Human Graders in Evaluating Praise
Given to Students in Synthetic Dialogues [2.3361634876233817]
Large language models, such as the AI-chatbot ChatGPT, hold potential for offering constructive feedback to tutors in practical settings.
The accuracy of AI-generated feedback remains uncertain, with scant research investigating the ability of models like ChatGPT to deliver effective feedback.
arXiv Detail & Related papers (2023-07-05T04:14:01Z) - Is GPT-4 a Good Data Analyst? [67.35956981748699]
We consider GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains.
We design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4.
Experimental results show that GPT-4 can achieve comparable performance to humans.
arXiv Detail & Related papers (2023-05-24T11:26:59Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - The Unreliability of Explanations in Few-Shot In-Context Learning [50.77996380021221]
We focus on two NLP tasks that involve reasoning over text, namely question answering and natural language inference.
We show that explanations judged as good by humans--those that are logically consistent with the input--usually indicate more accurate predictions.
We present a framework for calibrating model predictions based on the reliability of the explanations.
arXiv Detail & Related papers (2022-05-06T17:57:58Z) - Probing as Quantifying the Inductive Bias of Pre-trained Representations [99.93552997506438]
We present a novel framework for probing where the goal is to evaluate the inductive bias of representations for a particular task.
We apply our framework to a series of token-, arc-, and sentence-level tasks.
arXiv Detail & Related papers (2021-10-15T22:01:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.