Towards Optimizing and Evaluating a Retrieval Augmented QA Chatbot using LLMs with Human in the Loop
- URL: http://arxiv.org/abs/2407.05925v1
- Date: Mon, 8 Jul 2024 13:32:14 GMT
- Title: Towards Optimizing and Evaluating a Retrieval Augmented QA Chatbot using LLMs with Human in the Loop
- Authors: Anum Afzal, Alexander Kowsik, Rajna Fani, Florian Matthes,
- Abstract summary: Large Language Models have found application in mundane and repetitive tasks including Human Resource (HR) support.
We developed an HR support chatbots as an efficient and effective tool for addressing employee inquiries.
Our experiments and evaluation conclude that GPT-4 outperforms other models and can overcome inconsistencies in data.
Through expert analysis, we infer that reference-free evaluation metrics such as G-Eval and demonstrate reliability closely aligned with that of human evaluation.
- Score: 44.51779041553597
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Large Language Models have found application in various mundane and repetitive tasks including Human Resource (HR) support. We worked with the domain experts of SAP SE to develop an HR support chatbot as an efficient and effective tool for addressing employee inquiries. We inserted a human-in-the-loop in various parts of the development cycles such as dataset collection, prompt optimization, and evaluation of generated output. By enhancing the LLM-driven chatbot's response quality and exploring alternative retrieval methods, we have created an efficient, scalable, and flexible tool for HR professionals to address employee inquiries effectively. Our experiments and evaluation conclude that GPT-4 outperforms other models and can overcome inconsistencies in data through internal reasoning capabilities. Additionally, through expert analysis, we infer that reference-free evaluation metrics such as G-Eval and Prometheus demonstrate reliability closely aligned with that of human evaluation.
Related papers
- Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text [12.879551933541345]
Large Language Models (LLMs) are capable of generating human-like conversations.
Conventional metrics like BLEU and ROUGE are inadequate for capturing the subtle semantics and contextual richness of such generative outputs.
We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs-as-judges.
arXiv Detail & Related papers (2024-08-17T16:01:45Z) - BotEval: Facilitating Interactive Human Evaluation [21.99269491969255]
BotEval is an evaluation toolkit that enables human-bot interactions as part of the evaluation process.
We develop BotEval, an easily customizable, open-source, evaluation toolkit that focuses on enabling human-bot interactions as part of the evaluation process.
arXiv Detail & Related papers (2024-07-25T04:57:31Z) - The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches [0.0]
We discuss the issue with the increasingly popular LLM based evaluations and how they correlate with human evaluations.
We introduce a comprehensive factored evaluation mechanism that can be utilized in conjunction with both human and LLM-based evaluations.
Results show that factor based evaluation produces better insights on which aspects need to be improved in LLM applications.
arXiv Detail & Related papers (2024-06-05T14:55:10Z) - Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language Models [52.368110271614285]
We introduce AdvEval, a novel black-box adversarial framework against NLG evaluators.
AdvEval is specially tailored to generate data that yield strong disagreements between human and victim evaluators.
We conduct experiments on 12 victim evaluators and 11 NLG datasets, spanning tasks including dialogue, summarization, and question evaluation.
arXiv Detail & Related papers (2024-05-23T14:48:15Z) - Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities [48.922660354417204]
We propose Automated Dynamic Evaluation (AutoDE) to assess an assistant's API call capability without human involvement.
In our framework, we endeavor to closely mirror genuine human conversation patterns in human-machine interactions.
arXiv Detail & Related papers (2024-03-17T07:34:12Z) - Can AI Serve as a Substitute for Human Subjects in Software Engineering
Research? [24.39463126056733]
This vision paper proposes a novel approach to qualitative data collection in software engineering research by harnessing the capabilities of artificial intelligence (AI)
We explore the potential of AI-generated synthetic text as an alternative source of qualitative data.
We discuss the prospective development of new foundation models aimed at emulating human behavior in observational studies and user evaluations.
arXiv Detail & Related papers (2023-11-18T14:05:52Z) - Collaborative Evaluation: Exploring the Synergy of Large Language Models
and Humans for Open-ended Generation Evaluation [71.76872586182981]
Large language models (LLMs) have emerged as a scalable and cost-effective alternative to human evaluations.
We propose a Collaborative Evaluation pipeline CoEval, involving the design of a checklist of task-specific criteria and the detailed evaluation of texts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.