Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse
- URL: http://arxiv.org/abs/2409.11242v2
- Date: Fri, 11 Oct 2024 08:53:31 GMT
- Title: Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse
- Authors: Maojia Song, Shang Hong Sim, Rishabh Bhardwaj, Hai Leong Chieu, Navonil Majumder, Soujanya Poria,
- Abstract summary: We introduce Trust-Score, a holistic metric that evaluates the trustworthiness of LLMs within the RAG framework.
Our results show that various prompting methods, such as in-context learning, fail to effectively adapt LLMs to the RAG task.
We propose Trust-Align, a method to align LLMs for improved Trust-Score performance.
- Score: 27.26121507279163
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLMs are an integral component of retrieval-augmented generation (RAG) systems. While many studies focus on evaluating the overall quality of end-to-end RAG systems, there is a gap in understanding the appropriateness of LLMs for the RAG task. To address this, we introduce Trust-Score, a holistic metric that evaluates the trustworthiness of LLMs within the RAG framework. Our results show that various prompting methods, such as in-context learning, fail to effectively adapt LLMs to the RAG task as measured by Trust-Score. Consequently, we propose Trust-Align, a method to align LLMs for improved Trust-Score performance. The LLaMA-3 family, aligned using our method, significantly outperforms open-source LLMs of similar sizes on ASQA (up 14.0), QAMPARI (up 28.9), and ELI5 (up 13.7). We also demonstrate the effectiveness of Trust-Align across different open-weight models, including the LLaMA series (1b to 8b), Qwen-2.5 series (0.5b to 7b), and Phi3.5 (3.8b). We release our code at \url{https://anonymous.4open.science/r/trust-align}
Related papers
- Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation [45.059818539256426]
We propose the Chain-of-Embedding (CoE) in the latent space to enable LLMs to perform output-free self-evaluation.
CoE consists of all progressive hidden states produced during the inference time, which can be treated as the latent thinking path of LLMs.
arXiv Detail & Related papers (2024-10-17T15:09:24Z) - Beyond ChatGPT: Enhancing Software Quality Assurance Tasks with Diverse LLMs and Validation Techniques [14.230480872339463]
This paper investigates the capabilities of several Large Language Models (LLMs) across two SQA tasks: fault localization and vulnerability detection.
By implementing a voting mechanism to combine the LLMs' results, we achieved more than a 10% improvement over the GPT-3.5 in both tasks.
This approach led to performance improvements of 16% in fault localization and 12% in vulnerability detection compared to the GPT-3.5, with a 4% improvement compared to the best-performed LLMs.
arXiv Detail & Related papers (2024-09-02T07:26:19Z) - LLMSecCode: Evaluating Large Language Models for Secure Coding [0.24999074238880484]
This work aims to improve the selection process of Large Language Models (LLMs) that are suitable for facilitating Secure Coding (SC)
We introduce LLMSecCode, an open-source evaluation framework designed to assess SC capabilities objectively.
arXiv Detail & Related papers (2024-08-28T19:07:08Z) - Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans.
We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems.
Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z) - How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments [83.78240828340681]
We introduce GAMA($gamma$)-Bench, a new framework for evaluating Large Language Models' Gaming Ability in Multi-Agent environments.
$gamma$-Bench includes eight classical game theory scenarios and a dynamic scoring scheme specially designed to assess LLMs' performance.
Results indicate GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chain-of-Thought.
arXiv Detail & Related papers (2024-03-18T14:04:47Z) - Enabling Weak LLMs to Judge Response Reliability via Meta Ranking [38.63721941742435]
We propose a novel cross-query-comparison-based method called $textitMeta Ranking$ (MR)
MR assesses reliability by pairwisely ranking the target query-response pair with multiple reference query-response pairs.
We show that MR can enhance strong LLMs' performance in two practical applications: model cascading and instruction tuning.
arXiv Detail & Related papers (2024-02-19T13:57:55Z) - Knowledge Fusion of Large Language Models [73.28202188100646]
This paper introduces the notion of knowledge fusion for large language models (LLMs)
We externalize their collective knowledge and unique strengths, thereby elevating the capabilities of the target model beyond those of any individual source LLM.
Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation.
arXiv Detail & Related papers (2024-01-19T05:02:46Z) - TrustLLM: Trustworthiness in Large Language Models [446.5640421311468]
This paper introduces TrustLLM, a comprehensive study of trustworthiness in large language models (LLMs)
We first propose a set of principles for trustworthy LLMs that span eight different dimensions.
Based on these principles, we establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics.
arXiv Detail & Related papers (2024-01-10T22:07:21Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - LLMRec: Benchmarking Large Language Models on Recommendation Task [54.48899723591296]
The application of Large Language Models (LLMs) in the recommendation domain has not been thoroughly investigated.
We benchmark several popular off-the-shelf LLMs on five recommendation tasks, including rating prediction, sequential recommendation, direct recommendation, explanation generation, and review summarization.
The benchmark results indicate that LLMs displayed only moderate proficiency in accuracy-based tasks such as sequential and direct recommendation.
arXiv Detail & Related papers (2023-08-23T16:32:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.