Evaluation of Synthetic Datasets for Conversational Recommender Systems
- URL: http://arxiv.org/abs/2212.08167v1
- Date: Mon, 12 Dec 2022 18:53:10 GMT
- Title: Evaluation of Synthetic Datasets for Conversational Recommender Systems
- Authors: Harsh Lara, Manoj Tiwari
- Abstract summary: The absence of robust evaluation frameworks has been a long-standing problem.
Since the quality of training data is critical for downstream applications, it is important to develop metrics that evaluate the quality holistically.
In this paper, we present a framework that takes a multi-faceted approach towards evaluating datasets produced by generative models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For researchers leveraging Large-Language Models (LLMs) in the generation of
training datasets, especially for conversational recommender systems - the
absence of robust evaluation frameworks has been a long-standing problem. The
efficiency brought about by LLMs in the data generation phase is impeded during
the process of evaluation of the generated data, since it generally requires
human-raters to ensure that the data generated is of high quality and has
sufficient diversity. Since the quality of training data is critical for
downstream applications, it is important to develop metrics that evaluate the
quality holistically and identify biases. In this paper, we present a framework
that takes a multi-faceted approach towards evaluating datasets produced by
generative models and discuss the advantages and limitations of various
evaluation methods.
Related papers
- Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - Unveiling Context-Aware Criteria in Self-Assessing LLMs [28.156979106994537]
We propose a novel Self-Assessing LLM framework that integrates Context-Aware Criteria (SALC) with dynamic knowledge tailored to each evaluation instance.
Empirical evaluations demonstrate that our approach significantly outperforms existing baseline evaluation frameworks.
Our method also exhibits a improvement in LC Win-Rate in AlpacaEval2 leaderboard up to a 12% when employed for preference data generation.
arXiv Detail & Related papers (2024-10-28T21:18:49Z) - Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs [11.24476329991465]
Training large language models (LLMs) for external tool usage is a rapidly expanding field.
The absence of systematic data quality checks poses complications for properly training and testing models.
We propose two approaches for assessing the reliability of data for training LLMs to use external tools.
arXiv Detail & Related papers (2024-09-24T17:20:02Z) - Progressively Label Enhancement for Large Language Model Alignment [42.01694160556464]
Large Language Models (LLM) alignment aims to prevent models from producing content that misaligns with human expectations.
We propose PLE, a framework that dynamically adjusts the model's training process based on the evolving quality of the generated data.
arXiv Detail & Related papers (2024-08-05T16:21:17Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.
With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance.
Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - Dataset Regeneration for Sequential Recommendation [69.93516846106701]
We propose a data-centric paradigm for developing an ideal training dataset using a model-agnostic dataset regeneration framework called DR4SR.
To demonstrate the effectiveness of the data-centric paradigm, we integrate our framework with various model-centric methods and observe significant performance improvements across four widely adopted datasets.
arXiv Detail & Related papers (2024-05-28T03:45:34Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - AUGUST: an Automatic Generation Understudy for Synthesizing
Conversational Recommendation Datasets [56.052803235932686]
We propose a novel automatic dataset synthesis approach that can generate both large-scale and high-quality recommendation dialogues.
In doing so, we exploit: (i) rich personalized user profiles from traditional recommendation datasets, (ii) rich external knowledge from knowledge graphs, and (iii) the conversation ability contained in human-to-human conversational recommendation datasets.
arXiv Detail & Related papers (2023-06-16T05:27:14Z) - GMValuator: Similarity-based Data Valuation for Generative Models [41.76259565672285]
We introduce Generative Model Valuator (GMValuator), the first training-free and model-agnostic approach to provide data valuation for generation tasks.
GMValuator is extensively evaluated on various datasets and generative architectures to demonstrate its effectiveness.
arXiv Detail & Related papers (2023-04-21T02:02:02Z) - Exploring the Efficacy of Automatically Generated Counterfactuals for
Sentiment Analysis [17.811597734603144]
We propose an approach to automatically generating counterfactual data for data augmentation and explanation.
A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance.
arXiv Detail & Related papers (2021-06-29T10:27:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.