AGB-DE: A Corpus for the Automated Legal Assessment of Clauses in German Consumer Contracts
- URL: http://arxiv.org/abs/2406.06809v1
- Date: Mon, 10 Jun 2024 21:27:13 GMT
- Title: AGB-DE: A Corpus for the Automated Legal Assessment of Clauses in German Consumer Contracts
- Authors: Daniel Braun, Florian Matthes,
- Abstract summary: We introduce AGB-DE, a corpus of 3,764 clauses from German consumer contracts that have been annotated and legally assessed by legal experts.
We compare the performance of an SVM baseline with three fine-tuned open language models and the performance of GPT-3.5.
An analysis of the errors indicates that one of the main challenges could be the correct interpretation of complex clauses.
- Score: 4.427516854041417
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Legal tasks and datasets are often used as benchmarks for the capabilities of language models. However, openly available annotated datasets are rare. In this paper, we introduce AGB-DE, a corpus of 3,764 clauses from German consumer contracts that have been annotated and legally assessed by legal experts. Together with the data, we present a first baseline for the task of detecting potentially void clauses, comparing the performance of an SVM baseline with three fine-tuned open language models and the performance of GPT-3.5. Our results show the challenging nature of the task, with no approach exceeding an F1-score of 0.54. While the fine-tuned models often performed better with regard to precision, GPT-3.5 outperformed the other approaches with regard to recall. An analysis of the errors indicates that one of the main challenges could be the correct interpretation of complex clauses, rather than the decision boundaries of what is permissible and what is not.
Related papers
- JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance.
We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods.
In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z) - One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [55.35278531907263]
We present the first study on Large Language Models' fairness and robustness to a dialect in canonical reasoning tasks.
We hire AAVE speakers to rewrite seven popular benchmarks, such as HumanEval and GSM8K.
We find that, compared to Standardized English, almost all of these widely used models show significant brittleness and unfairness to queries in AAVE.
arXiv Detail & Related papers (2024-10-14T18:44:23Z) - LegalPro-BERT: Classification of Legal Provisions by fine-tuning BERT Large Language Model [0.0]
Contract analysis requires the identification and classification of key provisions and paragraphs within an agreement.
LegalPro-BERT is a BERT transformer architecture model that we fine- tune to efficiently handle classification task for legal provisions.
arXiv Detail & Related papers (2024-04-15T19:08:48Z) - Resolving Legalese: A Multilingual Exploration of Negation Scope
Resolution in Legal Documents [3.8467652838774873]
complexity of legal texts and lack of annotated in-domain negation corpora pose challenges for state-of-the-art (SotA) models.
Our experiments demonstrate that models pre-trained without legal data underperform in the task of negation scope resolution.
We release a new set of annotated court decisions in German, French, and Italian and use it to improve negation scope resolution in both zero-shot and multilingual settings.
arXiv Detail & Related papers (2023-09-15T18:38:06Z) - A negation detection assessment of GPTs: analysis with the xNot360
dataset [9.165119034384027]
Negation is a fundamental aspect of natural language, playing a critical role in communication and comprehension.
We focus on the identification of negation in natural language using a zero-shot prediction approach applied to our custom xNot360 dataset.
Our findings expose a considerable performance disparity among the GPT models, with GPT-4 surpassing its counterparts and GPT-3.5 displaying a marked performance reduction.
arXiv Detail & Related papers (2023-06-29T02:27:48Z) - Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation [20.675242617417677]
Cross-lingual cross-temporal summarization (CLCTS) is a largely unexplored area that has the potential to improve cross-cultural accessibility and understanding.
This paper comprehensively addresses the CLCTS task, including dataset creation, modeling, and evaluation.
arXiv Detail & Related papers (2023-06-22T14:31:18Z) - SAILER: Structure-aware Pre-trained Language Model for Legal Case
Retrieval [75.05173891207214]
Legal case retrieval plays a core role in the intelligent legal system.
Most existing language models have difficulty understanding the long-distance dependencies between different structures.
We propose a new Structure-Aware pre-traIned language model for LEgal case Retrieval.
arXiv Detail & Related papers (2023-04-22T10:47:01Z) - ChatGPT as a Factual Inconsistency Evaluator for Text Summarization [17.166794984161964]
We show that ChatGPT can evaluate factual inconsistency under a zero-shot setting.
It generally outperforms previous evaluation metrics on binary entailment inference, summary ranking, and consistency rating.
However, a closer inspection of ChatGPT's output reveals certain limitations including its preference for more lexically similar candidates, false reasoning, and inadequate understanding of instructions.
arXiv Detail & Related papers (2023-03-27T22:30:39Z) - Prompted Opinion Summarization with GPT-3.5 [115.95460650578678]
We show that GPT-3.5 models achieve very strong performance in human evaluation.
We argue that standard evaluation metrics do not reflect this, and introduce three new metrics targeting faithfulness, factuality, and genericity.
arXiv Detail & Related papers (2022-11-29T04:06:21Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z) - Elaboration-Generating Commonsense Question Answering at Scale [77.96137534751445]
In question answering requiring common sense, language models (e.g., GPT-3) have been used to generate text expressing background knowledge.
We finetune smaller language models to generate useful intermediate context, referred to here as elaborations.
Our framework alternates between updating two language models -- an elaboration generator and an answer predictor -- allowing each to influence the other.
arXiv Detail & Related papers (2022-09-02T18:32:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.