ClarQ-LLM: A Benchmark for Models Clarifying and Requesting Information in Task-Oriented Dialog
- URL: http://arxiv.org/abs/2409.06097v2
- Date: Sat, 14 Sep 2024 20:55:13 GMT
- Title: ClarQ-LLM: A Benchmark for Models Clarifying and Requesting Information in Task-Oriented Dialog
- Authors: Yujian Gan, Changling Li, Jinxia Xie, Luou Wen, Matthew Purver, Massimo Poesio,
- Abstract summary: We introduce ClarQ-LLM, an evaluation framework consisting of bilingual English-Chinese conversation tasks, conversational agents and evaluation metrics.
The benchmark includes 31 different task types, each with 10 unique dialogue scenarios between information seeker and provider agents.
Unlike traditional benchmarks that evaluate agents based on fixed dialogue content, ClarQ-LLM includes a provider conversational agent to replicate the original human provider.
- Score: 11.585398152713505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce ClarQ-LLM, an evaluation framework consisting of bilingual English-Chinese conversation tasks, conversational agents and evaluation metrics, designed to serve as a strong benchmark for assessing agents' ability to ask clarification questions in task-oriented dialogues. The benchmark includes 31 different task types, each with 10 unique dialogue scenarios between information seeker and provider agents. The scenarios require the seeker to ask questions to resolve uncertainty and gather necessary information to complete tasks. Unlike traditional benchmarks that evaluate agents based on fixed dialogue content, ClarQ-LLM includes a provider conversational agent to replicate the original human provider in the benchmark. This allows both current and future seeker agents to test their ability to complete information gathering tasks through dialogue by directly interacting with our provider agent. In tests, LLAMA3.1 405B seeker agent managed a maximum success rate of only 60.05\%, showing that ClarQ-LLM presents a strong challenge for future research.
Related papers
- Redefining Proactivity for Information Seeking Dialogue [8.986976693850869]
Information-Seeking Dialogue (ISD) agents aim to provide accurate responses to user queries.
We present a new definition of proactivity that focuses on enhancing the proactiveness' of each generated response.
We construct a proactive dialogue dataset comprising 2,000 single-turn conversations, and introduce several automatic metrics to evaluate response proactiveness'
arXiv Detail & Related papers (2024-10-20T05:57:10Z) - Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA)
Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents.
We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z) - ProductAgent: Benchmarking Conversational Product Search Agent with Asking Clarification Questions [68.81939215223818]
ProductAgent is a conversational information seeking agent equipped with abilities of strategic clarification question generation and dynamic product retrieval.
We develop the agent with strategies for product feature summarization, query generation, and product retrieval.
Experiments show that ProductAgent interacts positively with the user and enhances retrieval performance with increasing dialogue turns.
arXiv Detail & Related papers (2024-07-01T03:50:23Z) - TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge Assistants [10.511277428023305]
The extended TREC Interactive Knowledge Assistance Track (iKAT) collection aims to enable researchers to test and evaluate Conversational Search Agents (CSA)
The collection contains a set of 36 personalized dialogues over 20 different topics each coupled with a Personal Text Knowledge Base (PTKB) that defines the bespoke user personas.
A total of 344 turns with approximately 26,000 passages are provided as assessments on relevance, as well as additional assessments on generated responses over four key dimensions: relevance, completeness, groundedness, and naturalness.
arXiv Detail & Related papers (2024-05-04T11:22:16Z) - Dialogue Agents 101: A Beginner's Guide to Critical Ingredients for Designing Effective Conversational Systems [29.394466123216258]
This study provides a comprehensive overview of the primary characteristics of a dialogue agent, their corresponding open-domain datasets, and the methods used to benchmark these datasets.
We propose UNIT, a UNified dIalogue dataseT constructed from conversations of existing datasets for different dialogue tasks capturing the nuances for each of them.
arXiv Detail & Related papers (2023-07-14T10:05:47Z) - DialogQAE: N-to-N Question Answer Pair Extraction from Customer Service
Chatlog [34.69426306212259]
We propose N-to-N QA extraction task in which the derived questions and corresponding answers might be separated across different utterances.
We introduce a suite of generative/discriminative tagging based methods with end-to-end and two-stage variants that perform well on 5 customer service datasets.
arXiv Detail & Related papers (2022-12-14T09:05:14Z) - Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue [92.01165203498299]
Embodied dialogue instruction following requires an agent to complete a complex sequence of tasks from a natural language exchange.
This paper argues that imitation learning (IL) and related low-level metrics are actually misleading and do not align with the goals of embodied dialogue research.
arXiv Detail & Related papers (2022-10-10T05:51:40Z) - End-to-end Spoken Conversational Question Answering: Task, Dataset and
Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts.
We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows.
Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z) - Towards Data Distillation for End-to-end Spoken Conversational Question
Answering [65.124088336738]
We propose a new Spoken Conversational Question Answering task (SCQA)
SCQA aims at enabling QA systems to model complex dialogues flow given the speech utterances and text corpora.
Our main objective is to build a QA system to deal with conversational questions both in spoken and text forms.
arXiv Detail & Related papers (2020-10-18T05:53:39Z) - Multi-Stage Conversational Passage Retrieval: An Approach to Fusing Term
Importance Estimation and Neural Query Rewriting [56.268862325167575]
We tackle conversational passage retrieval (ConvPR) with query reformulation integrated into a multi-stage ad-hoc IR system.
We propose two conversational query reformulation (CQR) methods: (1) term importance estimation and (2) neural query rewriting.
For the former, we expand conversational queries using important terms extracted from the conversational context with frequency-based signals.
For the latter, we reformulate conversational queries into natural, standalone, human-understandable queries with a pretrained sequence-tosequence model.
arXiv Detail & Related papers (2020-05-05T14:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.