Reranking Overgenerated Responses for End-to-End Task-Oriented Dialogue
Systems
- URL: http://arxiv.org/abs/2211.03648v1
- Date: Mon, 7 Nov 2022 15:59:49 GMT
- Title: Reranking Overgenerated Responses for End-to-End Task-Oriented Dialogue
Systems
- Authors: Songbo Hu, Ivan Vuli\'c, Fangyu Liu, Anna Korhonen
- Abstract summary: End-to-end (E2E) task-oriented dialogue (ToD) systems are prone to fall into the so-called 'likelihood trap'
We propose a reranking method which aims to select high-quality items from the lists of responses initially overgenerated by the system.
Our methods improve a state-of-the-art E2E ToD system by 2.4 BLEU, 3.2 ROUGE, and 2.8 METEOR scores, achieving new peak results.
- Score: 71.33737787564966
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end (E2E) task-oriented dialogue (ToD) systems are prone to fall into
the so-called 'likelihood trap', resulting in generated responses which are
dull, repetitive, and often inconsistent with dialogue history. Comparing
ranked lists of multiple generated responses against the 'gold response' (from
training data) reveals a wide diversity in response quality, with many good
responses placed lower in the ranked list. The main challenge, addressed in
this work, is then how to reach beyond greedily generated system responses,
that is, how to obtain and select such high-quality responses from the list of
overgenerated responses at inference without availability of the gold response.
To this end, we propose a simple yet effective reranking method which aims to
select high-quality items from the lists of responses initially overgenerated
by the system. The idea is to use any sequence-level (similarity) scoring
function to divide the semantic space of responses into high-scoring versus
low-scoring partitions. At training, the high-scoring partition comprises all
generated responses whose similarity to the gold response is higher than the
similarity of the greedy response to the gold response. At inference, the aim
is to estimate the probability that each overgenerated response belongs to the
high-scoring partition, given only previous dialogue history. We validate the
robustness and versatility of our proposed method on the standard MultiWOZ
dataset: our methods improve a state-of-the-art E2E ToD system by 2.4 BLEU, 3.2
ROUGE, and 2.8 METEOR scores, achieving new peak results. Additional
experiments on the BiTOD dataset and human evaluation further ascertain the
generalisability and effectiveness of the proposed framework.
Related papers
- Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage [74.70255719194819]
We introduce a novel framework based on sub-question coverage, which measures how well a RAG system addresses different facets of a question.
We use this framework to evaluate three commercial generative answer engines: You.com, Perplexity AI, and Bing Chat.
We find that while all answer engines cover core sub-questions more often than background or follow-up ones, they still miss around 50% of core sub-questions.
arXiv Detail & Related papers (2024-10-20T22:59:34Z) - Towards Reliable and Factual Response Generation: Detecting Unanswerable
Questions in Information-Seeking Conversations [16.99952884041096]
Generative AI models face the challenge of hallucinations that can undermine users' trust in such systems.
We approach the problem of conversational information seeking as a two-step process, where relevant passages in a corpus are identified first and then summarized into a final system response.
Specifically, our proposed method employs a sentence-level classifier to detect if the answer is present, then aggregates these predictions on the passage level, and eventually across the top-ranked passages to arrive at a final answerability estimate.
arXiv Detail & Related papers (2024-01-21T10:15:36Z) - PICK: Polished & Informed Candidate Scoring for Knowledge-Grounded
Dialogue Systems [59.1250765143521]
Current knowledge-grounded dialogue systems often fail to align the generated responses with human-preferred qualities.
We propose Polished & Informed Candidate Scoring (PICK), a generation re-scoring framework.
We demonstrate the effectiveness of PICK in generating responses that are more faithful while keeping them relevant to the dialogue history.
arXiv Detail & Related papers (2023-09-19T08:27:09Z) - RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue [37.82954848948347]
We propose the Reference-Assisted Dialogue Evaluation (RADE) approach under the multi-task learning framework.
RADE explicitly compares reference and the candidate response to predict their overall scores.
Experiments on our three datasets and two existing benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-09-15T04:47:19Z) - A Systematic Evaluation of Response Selection for Open Domain Dialogue [36.88551817451512]
We curated a dataset where responses from multiple response generators produced for the same dialog context are manually annotated as appropriate (positive) and inappropriate (negative)
We conduct a systematic evaluation of state-of-the-art methods for response selection, and demonstrate that both strategies of using multiple positive candidates and using manually verified hard negative candidates can bring in significant performance improvement in comparison to using the adversarial training data, e.g., increase of 3% and 13% in Recall@1 score, respectively.
arXiv Detail & Related papers (2022-08-08T19:33:30Z) - Generate, Evaluate, and Select: A Dialogue System with a Response
Evaluator for Diversity-Aware Response Generation [9.247397520986999]
We aim to overcome the lack of diversity in responses of current dialogue systems.
We propose a generator-evaluator model that evaluates multiple responses generated by a response generator.
We conduct human evaluations to compare the output of the proposed system with that of a baseline system.
arXiv Detail & Related papers (2022-06-10T08:22:22Z) - Double Retrieval and Ranking for Accurate Question Answering [120.69820139008138]
We show that an answer verification step introduced in Transformer-based answer selection models can significantly improve the state of the art in Question Answering.
The results on three well-known datasets for AS2 show consistent and significant improvement of the state of the art.
arXiv Detail & Related papers (2022-01-16T06:20:07Z) - Diversifying Task-oriented Dialogue Response Generation with Prototype
Guided Paraphrasing [52.71007876803418]
Existing methods for Dialogue Response Generation (DRG) in Task-oriented Dialogue Systems ( TDSs) can be grouped into two categories: template-based and corpus-based.
We propose a prototype-based, paraphrasing neural network, called P2-Net, which aims to enhance quality of the responses in terms of both precision and diversity.
arXiv Detail & Related papers (2020-08-07T22:25:36Z) - Multi-Stage Conversational Passage Retrieval: An Approach to Fusing Term
Importance Estimation and Neural Query Rewriting [56.268862325167575]
We tackle conversational passage retrieval (ConvPR) with query reformulation integrated into a multi-stage ad-hoc IR system.
We propose two conversational query reformulation (CQR) methods: (1) term importance estimation and (2) neural query rewriting.
For the former, we expand conversational queries using important terms extracted from the conversational context with frequency-based signals.
For the latter, we reformulate conversational queries into natural, standalone, human-understandable queries with a pretrained sequence-tosequence model.
arXiv Detail & Related papers (2020-05-05T14:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.