Black-Box Analysis: GPTs Across Time in Legal Textual Entailment Task
- URL: http://arxiv.org/abs/2309.05501v1
- Date: Mon, 11 Sep 2023 14:43:54 GMT
- Title: Black-Box Analysis: GPTs Across Time in Legal Textual Entailment Task
- Authors: Ha-Thanh Nguyen, Randy Goebel, Francesca Toni, Kostas Stathis, Ken
Satoh
- Abstract summary: We present an analysis of GPT-3.5 (ChatGPT) and GPT-4 performances on COLIEE Task 4 dataset.
Our preliminary experimental results unveil intriguing insights into the models' strengths and weaknesses in handling legal textual entailment tasks.
- Score: 17.25356594832692
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The evolution of Generative Pre-trained Transformer (GPT) models has led to
significant advancements in various natural language processing applications,
particularly in legal textual entailment. We present an analysis of GPT-3.5
(ChatGPT) and GPT-4 performances on COLIEE Task 4 dataset, a prominent
benchmark in this domain. The study encompasses data from Heisei 18 (2006) to
Reiwa 3 (2021), exploring the models' abilities to discern entailment
relationships within Japanese statute law across different periods. Our
preliminary experimental results unveil intriguing insights into the models'
strengths and weaknesses in handling legal textual entailment tasks, as well as
the patterns observed in model performance. In the context of proprietary
models with undisclosed architectures and weights, black-box analysis becomes
crucial for evaluating their capabilities. We discuss the influence of training
data distribution and the implications on the models' generalizability. This
analysis serves as a foundation for future research, aiming to optimize
GPT-based models and enable their successful adoption in legal information
extraction and entailment applications.
Related papers
- Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.
We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.
Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - On Training Data Influence of GPT Models [37.53037752668756]
GPTfluence is a novel approach to assess the impact of training examples on the training dynamics of GPT models.
Our approach traces the influence of individual training instances on performance trajectories, such as loss and other key metrics, on targeted test points.
arXiv Detail & Related papers (2024-04-11T15:27:56Z) - Decoding News Narratives: A Critical Analysis of Large Language Models in Framing Detection [10.301985230669684]
This paper presents a comprehensive analysis of GPT-4, GPT-3.5 Turbo, and FLAN-T5 models in detecting framing in news headlines.
We evaluated these models in various scenarios: zero-shot, few-shot with in-domain examples, cross-domain examples, and settings where models explain their predictions.
arXiv Detail & Related papers (2024-02-18T15:27:48Z) - Gemini vs GPT-4V: A Preliminary Comparison and Combination of
Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision)
The core of our analysis delves into the distinct visual comprehension abilities of each model.
Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z) - SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its
Departure from Current Machine Learning [5.177947445379688]
This study presents a thorough examination of various Generative Pretrained Transformer (GPT) methodologies in sentiment analysis.
Three primary strategies are employed: 1) prompt engineering using the advanced GPT-3.5 Turbo, 2) fine-tuning GPT models, and 3) an inventive approach to embedding classification.
The research yields detailed comparative insights among these strategies and individual GPT models, revealing their unique strengths and potential limitations.
arXiv Detail & Related papers (2023-07-16T05:33:35Z) - Sensitivity and Robustness of Large Language Models to Prompt Template
in Japanese Text Classification Tasks [0.0]
A critical issue has been identified within this domain: the inadequate sensitivity and robustness of large language models towards Prompt templates.
This paper explores this issue through a comprehensive evaluation of several representative Large Language Models (LLMs) and a widely-utilized pre-trained model(PLM)
Our experimental results reveal startling discrepancies. A simple modification in the sentence structure of the Prompt template led to a drastic drop in the accuracy of GPT-4 from 49.21 to 25.44.
arXiv Detail & Related papers (2023-05-15T15:19:08Z) - Exploring the Trade-Offs: Unified Large Language Models vs Local
Fine-Tuned Models for Highly-Specific Radiology NLI Task [49.50140712943701]
We evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples.
We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty.
arXiv Detail & Related papers (2023-04-18T17:21:48Z) - A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models [71.42197262495056]
GPT series models have gained considerable attention due to their exceptional natural language processing capabilities.
We select six representative models, comprising two GPT-3 series models and four GPT-3.5 series models.
We evaluate their performance on nine natural language understanding (NLU) tasks using 21 datasets.
Our experiments reveal that the overall ability of GPT series models on NLU tasks does not increase gradually as the models evolve.
arXiv Detail & Related papers (2023-03-18T14:02:04Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Investigating Pretrained Language Models for Graph-to-Text Generation [55.55151069694146]
Graph-to-text generation aims to generate fluent texts from graph-based data.
We present a study across three graph domains: meaning representations, Wikipedia knowledge graphs (KGs) and scientific KGs.
We show that the PLMs BART and T5 achieve new state-of-the-art results and that task-adaptive pretraining strategies improve their performance even further.
arXiv Detail & Related papers (2020-07-16T16:05:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.