Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering
- URL: http://arxiv.org/abs/2501.06521v1
- Date: Sat, 11 Jan 2025 12:08:15 GMT
- Title: Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering
- Authors: Yinghao Hu, Leilei Gan, Wenyi Xiao, Kun Kuang, Fei Wu,
- Abstract summary: Hallucination, or the generation of incorrect or fabricated information, remains a critical challenge in large language models.
We propose a hallucination mitigation method that integrates behavior cloning and a novel Hard Sample-aware Iterative Direct Preference Optimization (HIPO)
Our results demonstrate remarkable improvements in various metrics, including the newly proposed Non-Hallucinated Statute Rate.
- Score: 30.795942355722797
- License:
- Abstract: Hallucination, or the generation of incorrect or fabricated information, remains a critical challenge in large language models (LLMs), particularly in high-stake domains such as legal question answering (QA). In order to mitigate the hallucination rate in legal QA, we first introduce a benchmark called LegalHalBench and three automatic metrics to evaluate the common hallucinations when LLMs answer legal questions. We then propose a hallucination mitigation method that integrates behavior cloning and a novel Hard Sample-aware Iterative Direct Preference Optimization (HIPO). We conduct extensive real-data experiments to validate the effectiveness of our approach. Our results demonstrate remarkable improvements in various metrics, including the newly proposed Non-Hallucinated Statute Rate, Statute Relevance Rate, Legal Claim Truthfulness, as well as traditional metrics such as METEOR, BERTScore, ROUGE-L, and win rates.
Related papers
- HuDEx: Integrating Hallucination Detection and Explainability for Enhancing the Reliability of LLM responses [0.12499537119440242]
This paper proposes an explanation enhanced hallucination-detection model, coined as HuDEx.
The proposed model provides a novel approach to integrate detection with explanations, and enable both users and the LLM itself to understand and reduce errors.
arXiv Detail & Related papers (2025-02-12T04:17:02Z) - Methods for Legal Citation Prediction in the Age of LLMs: An Australian Law Case Study [9.30538764385435]
We focus on the problem of legal citation prediction within the Australian law context, where correctly identifying and citing relevant legislations or precedents is critical.
Our findings indicate that domain-specific pre-training alone is insufficient for achieving satisfactory citation accuracy even after law-specialised pre-training.
In contrast, instruction tuning on our task-specific dataset dramatically boosts performance reaching the best results across all settings.
arXiv Detail & Related papers (2024-12-09T07:46:14Z) - The Factuality of Large Language Models in the Legal Domain [8.111302195052641]
This paper investigates the factuality of large language models (LLMs) as knowledge bases in the legal domain.
We design a dataset of diverse factual questions about case law and legislation.
We then use the dataset to evaluate several LLMs under different evaluation methods, including exact, alias, and fuzzy matching.
arXiv Detail & Related papers (2024-09-18T08:30:20Z) - HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination
Tendency of LLMs [0.0]
Hallucinations pose a significant challenge to the reliability and alignment of Large Language Models (LLMs)
This paper introduces an automated scalable framework that combines benchmarking LLMs' hallucination tendencies with efficient hallucination detection.
The framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain.
arXiv Detail & Related papers (2024-02-25T22:23:37Z) - Fine-Grained Self-Endorsement Improves Factuality and Reasoning [72.83651220132495]
This work studies improving large language model (LLM) generations at inference time by mitigating fact-conflicting hallucinations.
We propose a self-endorsement framework that leverages the fine-grained fact-level comparisons across multiple sampled responses.
arXiv Detail & Related papers (2024-02-23T22:24:40Z) - Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations.
We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms.
We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z) - Alleviating Hallucinations of Large Language Models through Induced
Hallucinations [67.35512483340837]
Large language models (LLMs) have been observed to generate responses that include inaccurate or fabricated information.
We propose a simple textitInduce-then-Contrast Decoding (ICD) strategy to alleviate hallucinations.
arXiv Detail & Related papers (2023-12-25T12:32:49Z) - Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields.
LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations.
We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z) - Chainpoll: A high efficacy method for LLM hallucination detection [0.0]
We introduce ChainPoll, an innovative hallucination detection method that excels compared to its counterparts.
We also unveil RealHall, a refined collection of benchmark datasets to assess hallucination detection metrics from recent studies.
arXiv Detail & Related papers (2023-10-22T14:45:14Z) - AutoHall: Automated Hallucination Dataset Generation for Large Language Models [56.92068213969036]
This paper introduces a method for automatically constructing model-specific hallucination datasets based on existing fact-checking datasets called AutoHall.
We also propose a zero-resource and black-box hallucination detection method based on self-contradiction.
arXiv Detail & Related papers (2023-09-30T05:20:02Z) - Contrastive Learning Reduces Hallucination in Conversations [76.55116206021346]
We propose a contrastive learning scheme, named MixCL.
A novel mixed contrastive objective is proposed to explicitly optimize the implicit knowledge elicitation process of LMs.
We show that MixCL achieves comparable performance to state-of-the-art KB-based approaches.
arXiv Detail & Related papers (2022-12-20T16:26:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.