Related papers: Black-Box Analysis: GPTs Across Time in Legal Textual Entailment Task

Black-Box Analysis: GPTs Across Time in Legal Textual Entailment Task

URL: http://arxiv.org/abs/2309.05501v1
Date: Mon, 11 Sep 2023 14:43:54 GMT
Title: Black-Box Analysis: GPTs Across Time in Legal Textual Entailment Task
Authors: Ha-Thanh Nguyen, Randy Goebel, Francesca Toni, Kostas Stathis, Ken Satoh
Abstract summary: We present an analysis of GPT-3.5 (ChatGPT) and GPT-4 performances on COLIEE Task 4 dataset. Our preliminary experimental results unveil intriguing insights into the models' strengths and weaknesses in handling legal textual entailment tasks.
Score: 17.25356594832692
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The evolution of Generative Pre-trained Transformer (GPT) models has led to significant advancements in various natural language processing applications, particularly in legal textual entailment. We present an analysis of GPT-3.5 (ChatGPT) and GPT-4 performances on COLIEE Task 4 dataset, a prominent benchmark in this domain. The study encompasses data from Heisei 18 (2006) to Reiwa 3 (2021), exploring the models' abilities to discern entailment relationships within Japanese statute law across different periods. Our preliminary experimental results unveil intriguing insights into the models' strengths and weaknesses in handling legal textual entailment tasks, as well as the patterns observed in model performance. In the context of proprietary models with undisclosed architectures and weights, black-box analysis becomes crucial for evaluating their capabilities. We discuss the influence of training data distribution and the implications on the models' generalizability. This analysis serves as a foundation for future research, aiming to optimize GPT-based models and enable their successful adoption in legal information extraction and entailment applications.

Related papers

Unraveling the Capabilities of Language Models in News Summarization [0.0]
This work provides a comprehensive benchmarking of 20 recent language models, focusing on smaller ones for the news summarization task. We focus in this study on zero-shot and few-shot learning settings and we apply a robust evaluation methodology. We highlight the exceptional performance of GPT-3.5-Turbo and GPT-4, which generally dominate due to their advanced capabilities.
arXiv Detail & Related papers (2025-01-30T04:20:16Z)
Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings.
arXiv Detail & Related papers (2024-10-24T17:56:08Z)
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z)
On Training Data Influence of GPT Models [37.53037752668756]
GPTfluence is a novel approach to assess the impact of training examples on the training dynamics of GPT models. Our approach traces the influence of individual training instances on performance trajectories, such as loss and other key metrics, on targeted test points.
arXiv Detail & Related papers (2024-04-11T15:27:56Z)
Decoding News Narratives: A Critical Analysis of Large Language Models in Framing Detection [10.301985230669684]
This paper presents a comprehensive analysis of GPT-4, GPT-3.5 Turbo, and FLAN-T5 models in detecting framing in news headlines. We evaluated these models in various scenarios: zero-shot, few-shot with in-domain examples, cross-domain examples, and settings where models explain their predictions.
arXiv Detail & Related papers (2024-02-18T15:27:48Z)
Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision) The core of our analysis delves into the distinct visual comprehension abilities of each model. Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z)
SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its Departure from Current Machine Learning [5.177947445379688]
This study presents a thorough examination of various Generative Pretrained Transformer (GPT) methodologies in sentiment analysis. Three primary strategies are employed: 1) prompt engineering using the advanced GPT-3.5 Turbo, 2) fine-tuning GPT models, and 3) an inventive approach to embedding classification. The research yields detailed comparative insights among these strategies and individual GPT models, revealing their unique strengths and potential limitations.
arXiv Detail & Related papers (2023-07-16T05:33:35Z)
Sensitivity and Robustness of Large Language Models to Prompt Template in Japanese Text Classification Tasks [0.0]
A critical issue has been identified within this domain: the inadequate sensitivity and robustness of large language models towards Prompt templates. This paper explores this issue through a comprehensive evaluation of several representative Large Language Models (LLMs) and a widely-utilized pre-trained model(PLM) Our experimental results reveal startling discrepancies. A simple modification in the sentence structure of the Prompt template led to a drastic drop in the accuracy of GPT-4 from 49.21 to 25.44.
arXiv Detail & Related papers (2023-05-15T15:19:08Z)
Exploring the Trade-Offs: Unified Large Language Models vs Local Fine-Tuned Models for Highly-Specific Radiology NLI Task [49.50140712943701]
We evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples. We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty.
arXiv Detail & Related papers (2023-04-18T17:21:48Z)
A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models [71.42197262495056]
GPT series models have gained considerable attention due to their exceptional natural language processing capabilities. We select six representative models, comprising two GPT-3 series models and four GPT-3.5 series models. We evaluate their performance on nine natural language understanding (NLU) tasks using 21 datasets. Our experiments reveal that the overall ability of GPT series models on NLU tasks does not increase gradually as the models evolve.
arXiv Detail & Related papers (2023-03-18T14:02:04Z)
Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP) What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
Investigating Pretrained Language Models for Graph-to-Text Generation [55.55151069694146]
Graph-to-text generation aims to generate fluent texts from graph-based data. We present a study across three graph domains: meaning representations, Wikipedia knowledge graphs (KGs) and scientific KGs. We show that the PLMs BART and T5 achieve new state-of-the-art results and that task-adaptive pretraining strategies improve their performance even further.
arXiv Detail & Related papers (2020-07-16T16:05:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.