Related papers: Leveraging LLMs for Dialogue Quality Measurement

Leveraging LLMs for Dialogue Quality Measurement

URL: http://arxiv.org/abs/2406.17304v1
Date: Tue, 25 Jun 2024 06:19:47 GMT
Title: Leveraging LLMs for Dialogue Quality Measurement
Authors: Jinghan Jia, Abi Komma, Timothy Leffel, Xujun Peng, Ajay Nagesh, Tamer Soliman, Aram Galstyan, Anoop Kumar,
Abstract summary: Large language models (LLMs) show robust zeroshot and few-shot capabilities across NLP tasks. Manipulating factors such as model size, in-context examples, and selection techniques, we examine "chain-of-thought" (CoT) reasoning and label extraction procedures. Our results indicate that LLMs that are suitably fine-tuned and have sufficient reasoning capabilities can be leveraged for automated dialogue evaluation.
Score: 27.046917937460798
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In task-oriented conversational AI evaluation, unsupervised methods poorly correlate with human judgments, and supervised approaches lack generalization. Recent advances in large language models (LLMs) show robust zeroshot and few-shot capabilities across NLP tasks. This paper explores using LLMs for automated dialogue quality evaluation, experimenting with various configurations on public and proprietary datasets. Manipulating factors such as model size, in-context examples, and selection techniques, we examine "chain-of-thought" (CoT) reasoning and label extraction procedures. Our results show that (1) larger models yield more accurate dialogue labels; (2) algorithmic selection of in-context examples outperforms random selection; (3) CoT reasoning where an LLM is asked to provide justifications before outputting final labels improves performance; and (4) fine-tuned LLMs outperform out-of-the-box ones. Our results indicate that LLMs that are suitably fine-tuned and have sufficient reasoning capabilities can be leveraged for automated dialogue evaluation.

Related papers

LLMs as Data Annotators: How Close Are We to Human Performance [47.61698665650761]
Manual annotation of data is labor-intensive, time-consuming, and costly. In-context learning (ICL) in which some examples related to the task are given in the prompt can lead to inefficiencies and suboptimal model performance. This paper presents experiments comparing several LLMs, considering different embedding models, across various datasets for the Named Entity Recognition (NER) task.
arXiv Detail & Related papers (2025-04-21T11:11:07Z)
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text [12.879551933541345]
Large Language Models (LLMs) are capable of generating human-like conversations. Conventional metrics like BLEU and ROUGE are inadequate for capturing the subtle semantics and contextual richness of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs-as-judges.
arXiv Detail & Related papers (2024-08-17T16:01:45Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
Aligning Language Models with Demonstrated Feedback [58.834937450242975]
Demonstration ITerated Task Optimization (DITTO) directly aligns language model outputs to a user's demonstrated behaviors. We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts.
arXiv Detail & Related papers (2024-06-02T23:13:56Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition [46.949604465227054]
We propose a sample-efficient human evaluation method based on MAximum Discrepancy (MAD) competition. MAD automatically selects a small set of informative and diverse instructions, each adapted to two LLMs. The pairwise comparison results are then aggregated into a global ranking using the Elo rating system.
arXiv Detail & Related papers (2024-04-10T01:26:24Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
PiCO: Peer Review in LLMs based on the Consistency Optimization [19.130941716491716]
We use peer-review mechanisms to measure large language models (LLMs) automatically. We formalize it as a constrained optimization problem, intending to maximize the consistency of each LLM's capabilities and scores. We propose three metrics called PEN, CIN, and LIS to evaluate the gap in aligning human rankings.
arXiv Detail & Related papers (2024-02-02T18:49:26Z)
A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators [46.939611070781794]
Large language models (LLMs) are shown to be promising substitutes for human judges. We analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels. We also probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels.
arXiv Detail & Related papers (2023-12-24T04:50:57Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z)
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models [28.441725610692714]
We propose a unified multi-dimensional automatic evaluation method for open-domain conversations with large language models (LLMs) We design a single prompt-based evaluation method that leverages a unified evaluation schema to cover multiple dimensions of conversation quality in a single model call. We extensively evaluate the performance of LLM-Eval on various benchmark datasets, demonstrating its effectiveness, efficiency, and adaptability compared to state-of-the-art evaluation methods.
arXiv Detail & Related papers (2023-05-23T05:57:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.