Related papers: Developing a Reliable, General-Purpose Hallucination Detection and Mitigation Service: Insights and Lessons Learned

Developing a Reliable, General-Purpose Hallucination Detection and Mitigation Service: Insights and Lessons Learned

URL: http://arxiv.org/abs/2407.15441v1
Date: Mon, 22 Jul 2024 07:48:30 GMT
Title: Developing a Reliable, General-Purpose Hallucination Detection and Mitigation Service: Insights and Lessons Learned
Authors: Song Wang, Xun Wang, Jie Mei, Yujia Xie, Sean Muarray, Zhang Li, Lingfeng Wu, Si-Qing Chen, Wayne Xiong,
Abstract summary: We introduce a reliable and high-speed production system aimed at detecting and rectifying the hallucination issue within large language models (LLMs) Our system encompasses named entity recognition (NER), natural language inference (NLI), span-based detection (SBD) We detail the core elements of our framework and underscore the paramount challenges tied to response time, availability, and performance metrics.
Score: 36.216938133315786
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hallucination, a phenomenon where large language models (LLMs) produce output that is factually incorrect or unrelated to the input, is a major challenge for LLM applications that require accuracy and dependability. In this paper, we introduce a reliable and high-speed production system aimed at detecting and rectifying the hallucination issue within LLMs. Our system encompasses named entity recognition (NER), natural language inference (NLI), span-based detection (SBD), and an intricate decision tree-based process to reliably detect a wide range of hallucinations in LLM responses. Furthermore, our team has crafted a rewriting mechanism that maintains an optimal mix of precision, response time, and cost-effectiveness. We detail the core elements of our framework and underscore the paramount challenges tied to response time, availability, and performance metrics, which are crucial for real-world deployment of these technologies. Our extensive evaluation, utilizing offline data and live production traffic, confirms the efficacy of our proposed framework and service.

Related papers

Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [48.15636223774418]
Large language models (LLMs) frequently hallucinate due to misaligned self-awareness. Existing approaches mitigate hallucinations via uncertainty estimation or query rejection. We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems.
arXiv Detail & Related papers (2025-03-04T03:16:02Z)
REFIND at SemEval-2025 Task 3: Retrieval-Augmented Factuality Hallucination Detection in Large Language Models [15.380441563675243]
REFIND (Retrieval-augmented Factuality hallucINation Detection) is a novel framework that detects hallucinated spans within large language model (LLM) outputs. We propose the Context Sensitivity Ratio (CSR), a novel metric that quantifies the sensitivity of LLM outputs to retrieved evidence. REFIND demonstrated robustness across nine languages, including low-resource settings, and significantly outperformed baseline models.
arXiv Detail & Related papers (2025-02-19T10:59:05Z)
SenseRAG: Constructing Environmental Knowledge Bases with Proactive Querying for LLM-Based Autonomous Driving [10.041702058108482]
This study addresses the critical need for enhanced situational awareness in autonomous driving (AD) by leveraging the contextual reasoning capabilities of large language models (LLMs) Unlike traditional perception systems that rely on rigid, label-based annotations, it integrates real-time, multimodal sensor data into a unified, LLMs-readable knowledge base. Experimental results using real-world Vehicle-to-everything (V2X) datasets demonstrate significant improvements in perception and prediction performance.
arXiv Detail & Related papers (2025-01-07T05:15:46Z)
Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output [49.893971654861424]
We present a light-weight approach for detecting nonfactual outputs from retrieval-augmented generation (RAG) We compute a factuality score that can be thresholded to yield a binary decision. Our experiments show high area under the ROC curve (AUC) across a wide range of relevant open source datasets.
arXiv Detail & Related papers (2024-11-01T20:44:59Z)
Beyond Fine-Tuning: Effective Strategies for Mitigating Hallucinations in Large Language Models for Data Analytics [0.0]
Large Language Models (LLMs) have become increasingly important in natural language processing, enabling advanced data analytics through natural language queries. These models often generate "hallucinations"-inaccurate or fabricated information-that can undermine their reliability in critical data-driven decision-making. This research focuses on mitigating hallucinations in LLMs, specifically within the context of data analytics.
arXiv Detail & Related papers (2024-10-26T00:45:42Z)
Investigating the Role of Prompting and External Tools in Hallucination Rates of Large Language Models [0.0]
Large Language Models (LLMs) are powerful computational models trained on extensive corpora of human-readable text, enabling them to perform general-purpose language understanding and generation. Despite these successes, LLMs often produce inaccuracies, commonly referred to as hallucinations. This paper provides an empirical evaluation of different prompting strategies and frameworks aimed at reducing hallucinations in LLMs.
arXiv Detail & Related papers (2024-10-25T08:34:53Z)
Boosting Healthcare LLMs Through Retrieved Context [0.6144680854063939]
This study explores the boundaries of context retrieval methods within the healthcare domain. Our findings reveal how open LLMs can achieve performance comparable to the biggest private solutions on established healthcare benchmarks. In particular, we propose OpenMedPrompt to improve the generation of more reliable open-ended answers.
arXiv Detail & Related papers (2024-09-23T15:33:38Z)
HALO: Hallucination Analysis and Learning Optimization to Empower LLMs with Retrieval-Augmented Context for Guided Clinical Decision Making [3.844437360527058]
In critical domains such as health and medicine, hallucinations can pose serious risks. This paper introduces HALO, a novel framework designed to enhance the accuracy and reliability of medical question-answering systems.
arXiv Detail & Related papers (2024-09-16T05:50:39Z)
SLM Meets LLM: Balancing Latency, Interpretability and Consistency in Hallucination Detection [10.54378596443678]
Large language models (LLMs) are highly capable but face latency challenges in real-time applications. This study optimize the real-time interpretable hallucination detection by introducing effective prompting techniques.
arXiv Detail & Related papers (2024-08-22T22:13:13Z)
Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph-based Retrofitting [51.7049140329611]
This paper proposes Knowledge Graph-based Retrofitting (KGR) to mitigate factual hallucination during the reasoning process. Experiments show that KGR can significantly improve the performance of LLMs on factual QA benchmarks.
arXiv Detail & Related papers (2023-11-22T11:08:38Z)
Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields. LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations. We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning. They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health. Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks. This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.