Pointwise Mutual Information Based Metric and Decoding Strategy for
Faithful Generation in Document Grounded Dialogs
- URL: http://arxiv.org/abs/2305.12191v2
- Date: Fri, 1 Dec 2023 10:37:56 GMT
- Title: Pointwise Mutual Information Based Metric and Decoding Strategy for
Faithful Generation in Document Grounded Dialogs
- Authors: Yatin Nandwani and Vineet Kumar and Dinesh Raghu and Sachindra Joshi
and Luis A. Lastras
- Abstract summary: Existing metrics measure the degree of similarity between the generated response and the document's content.
We propose a new metric that utilizes (Conditional) Point-wise Mutual Information (PMI) between the generated response and the source document.
PMI quantifies the extent to which the document influences the generated response.
We build upon this idea to create a new decoding technique that incorporates PMI into the response generation process to predict more faithful responses.
- Score: 17.691689809414843
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A major concern in using deep learning based generative models for
document-grounded dialogs is the potential generation of responses that are not
\textit{faithful} to the underlying document. Existing automated metrics used
for evaluating the faithfulness of response with respect to the grounding
document measure the degree of similarity between the generated response and
the document's content. However, these automated metrics are far from being
well aligned with human judgments. Therefore, to improve the measurement of
faithfulness, we propose a new metric that utilizes (Conditional) Point-wise
Mutual Information (PMI) between the generated response and the source
document, conditioned on the dialogue. PMI quantifies the extent to which the
document influences the generated response -- with a higher PMI indicating a
more faithful response. We build upon this idea to create a new decoding
technique that incorporates PMI into the response generation process to predict
more faithful responses. Our experiments on the BEGIN benchmark demonstrate an
improved correlation of our metric with human evaluation. We also show that our
decoding technique is effective in generating more faithful responses when
compared to standard decoding techniques on a set of publicly available
document-grounded dialog datasets.
Related papers
- CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems [43.5428962271088]
We propose a novel metric, called CausalScore, which assesses the relevance of responses by measuring the causal strength between dialogue histories and responses.
Our experimental results demonstrate that CausalScore significantly surpasses existing state-of-the-art metrics by aligning better with human judgements.
arXiv Detail & Related papers (2024-06-25T06:08:16Z) - Unlocking Structure Measuring: Introducing PDD, an Automatic Metric for Positional Discourse Coherence [39.065349875944634]
We present a novel metric designed to quantify the discourse divergence between two long-form articles.
Our metric aligns more closely with human preferences and GPT-4 coherence evaluation, outperforming existing evaluation methods.
arXiv Detail & Related papers (2024-02-15T18:23:39Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - Evaluating Code Summarization Techniques: A New Metric and an Empirical
Characterization [16.127739014966487]
We investigate the complementarity of different types of metrics in capturing the quality of a generated summary.
We present a new metric based on contrastive learning to capture said aspect.
arXiv Detail & Related papers (2023-12-24T13:12:39Z) - PICK: Polished & Informed Candidate Scoring for Knowledge-Grounded
Dialogue Systems [59.1250765143521]
Current knowledge-grounded dialogue systems often fail to align the generated responses with human-preferred qualities.
We propose Polished & Informed Candidate Scoring (PICK), a generation re-scoring framework.
We demonstrate the effectiveness of PICK in generating responses that are more faithful while keeping them relevant to the dialogue history.
arXiv Detail & Related papers (2023-09-19T08:27:09Z) - C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue
Evaluation [68.59356746305255]
We propose a novel model-agnostic approach to measure the turn-level interaction between the system and the user.
Our approach significantly improves the correlation with human judgment compared with existing evaluation systems.
arXiv Detail & Related papers (2023-06-27T06:58:03Z) - Using Textual Interface to Align External Knowledge for End-to-End
Task-Oriented Dialogue Systems [53.38517204698343]
We propose a novel paradigm that uses a textual interface to align external knowledge and eliminate redundant processes.
We demonstrate our paradigm in practice through MultiWOZ-Remake, including an interactive textual interface built for the MultiWOZ database.
arXiv Detail & Related papers (2023-05-23T05:48:21Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - Context Matters in Semantically Controlled Language Generation for
Task-oriented Dialogue Systems [6.1478669848771546]
This work combines information about the dialogue history encoded by pre-trained model with a meaning representation of the current system utterance to realize contextual language generation in task-oriented dialogues.
We utilize the pre-trained multi-context ConveRT model for context representation in a model trained from scratch; and leverage the immediate preceding user utterance for context generation in a model adapted from the pre-trained GPT-2.
arXiv Detail & Related papers (2021-11-28T11:48:02Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.