InCA: Rethinking In-Car Conversational System Assessment Leveraging
Large Language Models
- URL: http://arxiv.org/abs/2311.07469v2
- Date: Wed, 15 Nov 2023 22:10:34 GMT
- Title: InCA: Rethinking In-Car Conversational System Assessment Leveraging
Large Language Models
- Authors: Ken E. Friedl, Abbas Goher Khan, Soumya Ranjan Sahoo, Md Rashad Al
Hasan Rony, Jana Germies, Christian S\"u{\ss}
- Abstract summary: This paper introduces a set of datasets specifically designed for in-car conversational question answering (ConvQA) systems.
A preliminary and comprehensive empirical evaluation substantiates the efficacy of our proposed approach.
- Score: 2.2602594453321063
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The assessment of advanced generative large language models (LLMs) poses a
significant challenge, given their heightened complexity in recent
developments. Furthermore, evaluating the performance of LLM-based applications
in various industries, as indicated by Key Performance Indicators (KPIs), is a
complex undertaking. This task necessitates a profound understanding of
industry use cases and the anticipated system behavior. Within the context of
the automotive industry, existing evaluation metrics prove inadequate for
assessing in-car conversational question answering (ConvQA) systems. The unique
demands of these systems, where answers may relate to driver or car safety and
are confined within the car domain, highlight the limitations of current
metrics. To address these challenges, this paper introduces a set of KPIs
tailored for evaluating the performance of in-car ConvQA systems, along with
datasets specifically designed for these KPIs. A preliminary and comprehensive
empirical evaluation substantiates the efficacy of our proposed approach.
Furthermore, we investigate the impact of employing varied personas in prompts
and found that it enhances the model's capacity to simulate diverse viewpoints
in assessments, mirroring how individuals with different backgrounds perceive a
topic.
Related papers
- AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [55.14033256706175]
Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information.
We introduce AutoBench-V, an automated framework for serving evaluation on demand.
Through an extensive evaluation of seven popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs [29.72874725703848]
We introduce two concepts: Benchmark+, which extends traditional question-answer benchmark into a more flexible "strategy-criterion" format; and Assessment+, which enhances the interaction process.
We propose an agent-based evaluation framework called TestAgent, which implements these concepts through retrieval augmented generation and reinforcement learning.
arXiv Detail & Related papers (2024-10-15T11:20:42Z) - Centralization potential of automotive E/E architectures [2.7143159361691227]
A centralized architecture is often seen as a key enabler to master challenges.
There is a research gap on guidelines for system designers and function developers to analyze the potential of their systems for centralization.
This paper bridges the gap between theoretical research and practical application, offering valuable takeaways for practitioners.
arXiv Detail & Related papers (2024-09-16T19:36:32Z) - Trustworthiness in Retrieval-Augmented Generation Systems: A Survey [59.26328612791924]
Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs)
We propose a unified framework that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy.
arXiv Detail & Related papers (2024-09-16T09:06:44Z) - Towards Flexible Evaluation for Generative Visual Question Answering [17.271448204525612]
This paper proposes the use of semantics-based evaluators for assessing unconstrained open-ended responses on Visual Question Answering (VQA) datasets.
In addition, this paper proposes a Semantically Flexible VQA Evaluator (SFVE) with meticulous design based on the unique features of VQA evaluation.
arXiv Detail & Related papers (2024-08-01T05:56:34Z) - KaPQA: Knowledge-Augmented Product Question-Answering [59.096607961704656]
We introduce two product question-answering (QA) datasets focused on Adobe Acrobat and Photoshop products.
We also propose a novel knowledge-driven RAG-QA framework to enhance the performance of the models in the product QA task.
arXiv Detail & Related papers (2024-07-22T22:14:56Z) - Evaluating General-Purpose AI with Psychometrics [43.85432514910491]
We discuss the need for a comprehensive and accurate evaluation of general-purpose AI systems such as large language models.
Current evaluation methodology, mostly based on benchmarks of specific tasks, falls short of adequately assessing these versatile AI systems.
To tackle these challenges, we suggest transitioning from task-oriented evaluation to construct-oriented evaluation.
arXiv Detail & Related papers (2023-10-25T05:38:38Z) - Rethinking Word-Level Auto-Completion in Computer-Aided Translation [76.34184928621477]
Word-Level Auto-Completion (WLAC) plays a crucial role in Computer-Assisted Translation.
It aims at providing word-level auto-completion suggestions for human translators.
We introduce a measurable criterion to answer this question and discover that existing WLAC models often fail to meet this criterion.
We propose an effective approach to enhance WLAC performance by promoting adherence to the criterion.
arXiv Detail & Related papers (2023-10-23T03:11:46Z) - Overview of Robust and Multilingual Automatic Evaluation Metrics for
Open-Domain Dialogue Systems at DSTC 11 Track 4 [51.142614461563184]
This track in the 11th Dialogue System Technology Challenge (DSTC11) is part of the ongoing effort to promote robust and multilingual automatic evaluation metrics.
This article describes the datasets and baselines provided to participants and discusses the submission and result details of the two proposed subtasks.
arXiv Detail & Related papers (2023-06-22T10:50:23Z) - Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments.
It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z) - Evaluation Gaps in Machine Learning Practice [13.963766987258161]
In practice, evaluations of machine learning models frequently focus on a narrow range of decontextualized predictive behaviours.
We examine the evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations.
By studying these properties, we demonstrate the machine learning discipline's implicit assumption of a range of commitments which have normative impacts.
arXiv Detail & Related papers (2022-05-11T04:00:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.