Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations
- URL: http://arxiv.org/abs/2409.09947v2
- Date: Mon, 23 Sep 2024 19:06:25 GMT
- Title: Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations
- Authors: Abe Bohan Hou, William Jurayj, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme,
- Abstract summary: Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses.
LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics.
We introduce the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis.
- Score: 38.30926471814935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses. However, LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics. In this work, we pose the question: when can machine-generated legal analysis be evaluated as acceptable? We introduce the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis. Gaps do not always equate to invalid generation. Working with legal experts, we consider the CLERC generation task proposed in Hou et al. (2024b), leading to a taxonomy, a fine-grained detector for predicting gap categories, and an annotated dataset for automatic evaluation. Our best detector achieves 67% F1 score and 80% precision on the test set. Employing this detector as an automated metric on legal analysis generated by SOTA LLMs, we find around 80% contain hallucinations of different kinds.
Related papers
- Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools [32.78336381381673]
We report on the first preregistered empirical evaluation of AI-driven legal research tools.
We find that the AI research tools made by LexisNexis (Lexis+ AI) and Thomson Reuters (Westlaw AI-Assisted Research and Ask Practical Law AI) each hallucinate between 17% and 33% of the time.
It provides evidence to inform the responsibilities of legal professionals in supervising and verifying AI outputs.
arXiv Detail & Related papers (2024-05-30T17:56:05Z) - Explainable machine learning multi-label classification of Spanish legal judgements [6.817247544942709]
We propose a hybrid system that applies Machine Learning for multi-label classification of judgements (sentences) and visual and natural language descriptions for explanation purposes.
Our solution achieves over 85 % micro precision on a labelled data set annotated by legal experts.
arXiv Detail & Related papers (2024-05-27T19:16:42Z) - Automatic explanation of the classification of Spanish legal judgments in jurisdiction-dependent law categories with tree estimators [6.354358255072839]
This work contributes with a system combining Natural Language Processing (NLP) with Machine Learning (ML) to classify legal texts in an explainable manner.
We analyze the features involved in the decision and the threshold bifurcation values of the decision paths of tree structures.
Legal experts have validated our solution, and this knowledge has also been incorporated into the explanation process as "expert-in-the-loop" dictionaries.
arXiv Detail & Related papers (2024-03-30T17:59:43Z) - HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination
Tendency of LLMs [0.0]
Hallucinations pose a significant challenge to the reliability and alignment of Large Language Models (LLMs)
This paper introduces an automated scalable framework that combines benchmarking LLMs' hallucination tendencies with efficient hallucination detection.
The framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain.
arXiv Detail & Related papers (2024-02-25T22:23:37Z) - Comparing Hallucination Detection Metrics for Multilingual Generation [62.97224994631494]
This paper assesses how well various factual hallucination detection metrics identify hallucinations in generated biographical summaries across languages.
We compare how well automatic metrics correlate to each other and whether they agree with human judgments of factuality.
Our analysis reveals that while the lexical metrics are ineffective, NLI-based metrics perform well, correlating with human annotations in many settings and often outperforming supervised models.
arXiv Detail & Related papers (2024-02-16T08:10:34Z) - Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations.
We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms.
We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z) - "Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation [90.09260023184932]
Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations.
NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages.
We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
arXiv Detail & Related papers (2023-12-18T17:18:04Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z) - Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation [5.043563227694139]
Large language models (large LMs) are susceptible to producing text that contains hallucinated content.
We present a comprehensive investigation into self-contradiction for various instruction-tuned LMs.
We propose a novel prompting-based framework designed to effectively detect and mitigate self-contradictions.
arXiv Detail & Related papers (2023-05-25T08:43:46Z) - Can AI-Generated Text be Reliably Detected? [54.670136179857344]
Unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc.
Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques.
In this paper, we show that these detectors are not reliable in practical scenarios.
arXiv Detail & Related papers (2023-03-17T17:53:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.