EventChat: Implementation and user-centric evaluation of a large language model-driven conversational recommender system for exploring leisure events in an SME context
- URL: http://arxiv.org/abs/2407.04472v3
- Date: Tue, 9 Jul 2024 13:31:00 GMT
- Title: EventChat: Implementation and user-centric evaluation of a large language model-driven conversational recommender system for exploring leisure events in an SME context
- Authors: Hannes Kunstmann, Joseph Ollier, Joel Persson, Florian von Wangenheim,
- Abstract summary: Large language models (LLMs) present an enormous evolution in the strategic potential of conversational recommender systems (CRS)
Yet, research has predominantly focused upon technical frameworks to implement LLM-driven CRS, rather than end-user evaluations or strategic implications for firms.
We detail the design of an LLM-driven CRS in an SME setting, and its subsequent performance in the field using both objective system metrics and subjective user evaluations.
- Score: 0.9999629695552196
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) present an enormous evolution in the strategic potential of conversational recommender systems (CRS). Yet to date, research has predominantly focused upon technical frameworks to implement LLM-driven CRS, rather than end-user evaluations or strategic implications for firms, particularly from the perspective of a small to medium enterprises (SME) that makeup the bedrock of the global economy. In the current paper, we detail the design of an LLM-driven CRS in an SME setting, and its subsequent performance in the field using both objective system metrics and subjective user evaluations. While doing so, we additionally outline a short-form revised ResQue model for evaluating LLM-driven CRS, enabling replicability in a rapidly evolving field. Our results reveal good system performance from a user experience perspective (85.5% recommendation accuracy) but underscore latency, cost, and quality issues challenging business viability. Notably, with a median cost of $0.04 per interaction and a latency of 5.7s, cost-effectiveness and response time emerge as crucial areas for achieving a more user-friendly and economically viable LLM-driven CRS for SME settings. One major driver of these costs is the use of an advanced LLM as a ranker within the retrieval-augmented generation (RAG) technique. Our results additionally indicate that relying solely on approaches such as Prompt-based learning with ChatGPT as the underlying LLM makes it challenging to achieve satisfying quality in a production environment. Strategic considerations for SMEs deploying an LLM-driven CRS are outlined, particularly considering trade-offs in the current technical landscape.
Related papers
- LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models [71.8065384742686]
LMMS-EVAL is a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models.
LMMS-EVAL LITE is a pruned evaluation toolkit that emphasizes both coverage and efficiency.
Multimodal LIVEBENCH utilizes continuously updating news and online forums to assess models' generalization abilities in the wild.
arXiv Detail & Related papers (2024-07-17T17:51:53Z) - CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines [29.25579967636023]
We introduce CEBench, an open-source toolkit for benchmarking online large language models.
It focuses on the critical trade-offs between expenditure and effectiveness required for LLM deployments.
This capability supports crucial decision-making processes aimed at maximizing effectiveness while minimizing cost impacts.
arXiv Detail & Related papers (2024-06-20T21:36:00Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - RepEval: Effective Text Evaluation with LLM Representation [54.07909112633993]
We introduce RepEval, the first metric leveraging the projection of LLM representations for evaluation.
RepEval requires minimal sample pairs for training, and through simple prompt modifications, it can easily transition to various tasks.
Results on ten datasets from three tasks demonstrate the high effectiveness of our method.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [44.401826163314716]
We propose a new evaluation paradigm for MLLMs using potent MLLM as the judge.
We benchmark 21 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models.
The validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation.
arXiv Detail & Related papers (2023-11-23T12:04:25Z) - Cache me if you Can: an Online Cost-aware Teacher-Student framework to
Reduce the Calls to Large Language Models [13.799197575126442]
Small and medium-sized enterprises (SMEs) cannot afford the cost of creating large task-specific training datasets.
Third-party services that allow them to prompt Large Language Models currently require a payment per call.
We propose a framework that allows reducing the calls to LLMs by caching previous responses and using them to train a local inexpensive model.
arXiv Detail & Related papers (2023-10-20T10:05:07Z) - Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena [25.865825113847404]
We introduce AucArena, a novel evaluation suite that simulates auctions.
We conduct controlled experiments using state-of-the-art Large Language Models (LLMs) to power bidding agents to benchmark their planning and execution skills.
arXiv Detail & Related papers (2023-10-09T14:22:09Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z) - How Can Recommender Systems Benefit from Large Language Models: A Survey [82.06729592294322]
Large language models (LLM) have shown impressive general intelligence and human-like capabilities.
We conduct a comprehensive survey on this research direction from the perspective of the whole pipeline in real-world recommender systems.
arXiv Detail & Related papers (2023-06-09T11:31:50Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.