Related papers: Team UTSA-NLP at SemEval 2024 Task 5: Prompt Ensembling for Argument Reasoning in Civil Procedures with GPT4

Team UTSA-NLP at SemEval 2024 Task 5: Prompt Ensembling for Argument Reasoning in Civil Procedures with GPT4

URL: http://arxiv.org/abs/2404.01961v1
Date: Tue, 2 Apr 2024 13:55:05 GMT
Title: Team UTSA-NLP at SemEval 2024 Task 5: Prompt Ensembling for Argument Reasoning in Civil Procedures with GPT4
Authors: Dan Schumacher, Anthony Rios,
Abstract summary: We present our system for the SemEval Task 5, The Legal Argument Reasoning Task in Civil Procedure Challenge. Our system explores a prompt-based solution using GPT4 to reason over legal arguments. Overall, our system results in a Macro F1 of.8095 on the validation dataset and.7315 (5th out of 21 teams) on the final test set.
Score: 7.613758211231583
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we present our system for the SemEval Task 5, The Legal Argument Reasoning Task in Civil Procedure Challenge. Legal argument reasoning is an essential skill that all law students must master. Moreover, it is important to develop natural language processing solutions that can reason about a question given terse domain-specific contextual information. Our system explores a prompt-based solution using GPT4 to reason over legal arguments. We also evaluate an ensemble of prompting strategies, including chain-of-thought reasoning and in-context learning. Overall, our system results in a Macro F1 of .8095 on the validation dataset and .7315 (5th out of 21 teams) on the final test set. Code for this project is available at https://github.com/danschumac1/CivilPromptReasoningGPT4.

Related papers

Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams [48.99818550820575]
We leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams. Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques.
arXiv Detail & Related papers (2024-11-07T22:51:47Z)
Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems [59.72548591120689]
We introduce a new benchmark, SearchBench, containing 11 unique search problem types. We show that even the most advanced LLMs fail to solve these problems end-to-end in text. Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%.
arXiv Detail & Related papers (2024-06-18T00:44:58Z)
NLP at UC Santa Cruz at SemEval-2024 Task 5: Legal Answer Validation using Few-Shot Multi-Choice QA [0.0]
We present two approaches to solving the task of legal answer validation. First, we fine-tuned pre-trained BERT-based models and found that models trained on domain knowledge perform better. Second, we performed few-shot prompting on GPT models and found that reformulating the answer validation task to be a multiple-choice QA task remarkably improves the performance of the model.
arXiv Detail & Related papers (2024-04-04T01:50:20Z)
GPTs and Language Barrier: A Cross-Lingual Legal QA Examination [5.253214457141011]
We explore the application of Generative Pre-trained Transformers (GPTs) in cross-lingual legal Question-Answering (QA) systems using the COLIEE Task 4 dataset. In the COLIEE Task 4, given a statement and a set of related legal articles that serve as context, the objective is to determine whether the statement is legally valid. By benchmarking four different combinations of English and Japanese prompts and data, we provide valuable insights into GPTs' performance in multilingual legal QA scenarios.
arXiv Detail & Related papers (2024-03-26T20:47:32Z)
Towards Unsupervised Question Answering System with Multi-level Summarization for Legal Text [0.0]
This paper summarizes Team SCaLAR's work on SemEval-2024 Task 5: Legal Argument Reasoning in Civil Procedure. We propose a simple yet novel similarity and distance-based unsupervised approach to generate labels. Our unsupervised system witnessed a 20-point increase in macro F1-score on the development set and a 10-point increase on the test set.
arXiv Detail & Related papers (2024-03-19T19:15:13Z)
Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces. We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered. Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z)
Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks [8.223311621898983]
GPT-4 with conversational prompts showed drastic improvement compared to GPT-4 with automatic prompting strategies. fully automated prompt engineering with no human in the loop requires more study and improvement.
arXiv Detail & Related papers (2023-10-11T00:21:00Z)
Pushing the Limits of ChatGPT on NLP Tasks [79.17291002710517]
Despite the success of ChatGPT, its performances on most NLP tasks are still well below the supervised baselines. In this work, we looked into the causes, and discovered that its subpar performance was caused by the following factors. We propose a collection of general modules to address these issues, in an attempt to push the limits of ChatGPT on NLP tasks.
arXiv Detail & Related papers (2023-06-16T09:40:05Z)
Lila: A Unified Benchmark for Mathematical Reasoning [59.97570380432861]
LILA is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs. We introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA.
arXiv Detail & Related papers (2022-10-31T17:41:26Z)
Factoring Statutory Reasoning as Language Understanding Challenges [48.13180364616141]
We decompose statutory reasoning into four types of language-understanding challenge problems. We introduce concepts and structure found in Prolog programs. Models for statutory reasoning are shown to benefit from the additional structure.
arXiv Detail & Related papers (2021-05-17T14:33:02Z)
IIE-NLP-NUT at SemEval-2020 Task 4: Guiding PLM with Prompt Template Reconstruction Strategy for ComVE [13.334749848189826]
We formalize the subtasks into the multiple-choice question answering format and construct the input with the prompt templates. Experimental results show that our approaches achieve significant performance compared with the baseline systems. Our approaches secure the third rank on both official test sets of the first two subtasks with an accuracy of 96.4 and an accuracy of 94.3 respectively.
arXiv Detail & Related papers (2020-07-02T06:59:53Z)
CS-NLP team at SemEval-2020 Task 4: Evaluation of State-of-the-art NLP Deep Learning Architectures on Commonsense Reasoning Task [3.058685580689605]
We describe our attempt at SemEval-2020 Task 4 competition: Commonsense Validation and Explanation (ComVE) challenge. Our system uses prepared labeled textual datasets that were manually curated for three different natural language inference subtasks. For the second subtask, which is to select the reason why a statement does not make sense, we stand within the first six teams (93.7%) among 27 participants with very competitive results.
arXiv Detail & Related papers (2020-05-17T13:20:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.