Quality Assurance for LLM-RAG Systems: Empirical Insights from Tourism Application Testing
- URL: http://arxiv.org/abs/2502.05782v1
- Date: Sun, 09 Feb 2025 05:53:03 GMT
- Title: Quality Assurance for LLM-RAG Systems: Empirical Insights from Tourism Application Testing
- Authors: Bestoun S. Ahmed, Ludwig Otto Baader, Firas Bayram, Siri Jagstedt, Peter Magnusson,
- Abstract summary: This paper presents a comprehensive framework for testing and evaluating quality characteristics of Large Language Model (LLM) systems enhanced with Retrieval-Augmented Generation (RAG)
We demonstrate the effectiveness of our testing methodology in assessing both functional correctness and extra-functional properties.
- Score: 0.0
- License:
- Abstract: This paper presents a comprehensive framework for testing and evaluating quality characteristics of Large Language Model (LLM) systems enhanced with Retrieval-Augmented Generation (RAG) in tourism applications. Through systematic empirical evaluation of three different LLM variants across multiple parameter configurations, we demonstrate the effectiveness of our testing methodology in assessing both functional correctness and extra-functional properties. Our framework implements 17 distinct metrics that encompass syntactic analysis, semantic evaluation, and behavioral evaluation through LLM judges. The study reveals significant information about how different architectural choices and parameter configurations affect system performance, particularly highlighting the impact of temperature and top-p parameters on response quality. The tests were carried out on a tourism recommendation system for the V\"armland region, utilizing standard and RAG-enhanced configurations. The results indicate that the newer LLM versions show modest improvements in performance metrics, though the differences are more pronounced in response length and complexity rather than in semantic quality. The research contributes practical insights for implementing robust testing practices in LLM-RAG systems, providing valuable guidance to organizations deploying these architectures in production environments.
Related papers
- OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain.
Our benchmark is characterized by its multi-dimensional evaluation framework.
Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z) - Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs [64.9693406713216]
Internal mechanisms that contribute to the effectiveness of RAG systems remain underexplored.
Our experiments reveal that several core groups of experts are primarily responsible for RAG-related behaviors.
We propose several strategies to enhance RAG's efficiency and effectiveness through expert activation.
arXiv Detail & Related papers (2024-10-20T16:08:54Z) - LLaVA-Critic: Learning to Evaluate Multimodal Models [110.06665155812162]
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator.
LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios.
arXiv Detail & Related papers (2024-10-03T17:36:33Z) - MILE: A Mutation Testing Framework of In-Context Learning Systems [5.419884861365132]
We propose a mutation testing framework designed to characterize the quality and effectiveness of test data for ICL systems.
First, we propose several mutation operators specialized for ICL demonstrations, as well as corresponding mutation scores for ICL test sets.
With comprehensive experiments, we showcase the effectiveness of our framework in evaluating the reliability and quality of ICL test suites.
arXiv Detail & Related papers (2024-09-07T13:51:42Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.
With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance.
Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs.
Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods.
This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z) - RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems [51.171355532527365]
Retrieval-augmented generation (RAG) can significantly improve the performance of language models (LMs)
RAGGED is a framework for analyzing RAG configurations across various document-based question answering tasks.
arXiv Detail & Related papers (2024-03-14T02:26:31Z) - METAL: Metamorphic Testing Framework for Analyzing Large-Language Model
Qualities [4.493507573183107]
Large-Language Models (LLMs) have shifted the paradigm of natural language data processing.
Recent studies have tested Quality Attributes (QAs) of LLMs by generating adversarial input texts.
We propose a MEtamorphic Testing for Analyzing LLMs (METAL) framework to address these issues.
arXiv Detail & Related papers (2023-12-11T01:29:19Z) - Fairness and underspecification in acoustic scene classification: The
case for disaggregated evaluations [6.186191586944725]
Underspecification and fairness in machine learning (ML) applications have recently become two prominent issues in the ML community.
We argue for the need of a more holistic evaluation process for Acoustic scene classification (ASC) models through disaggregated evaluations.
We demonstrate the effectiveness of the proposed evaluation process in uncovering underspecification and fairness problems when trained on two widely-used ASC datasets.
arXiv Detail & Related papers (2021-10-04T15:23:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.