Evaluating Robustness of Dialogue Summarization Models in the Presence
of Naturally Occurring Variations
- URL: http://arxiv.org/abs/2311.08705v1
- Date: Wed, 15 Nov 2023 05:11:43 GMT
- Title: Evaluating Robustness of Dialogue Summarization Models in the Presence
of Naturally Occurring Variations
- Authors: Ankita Gupta, Chulaka Gunasekara, Hui Wan, Jatin Ganhotra, Sachindra
Joshi, Marina Danilevsky
- Abstract summary: We systematically investigate the impact of real-life variations on state-of-the-art dialogue summarization models.
We introduce two types of perturbations: utterance-level perturbations that modify individual utterances with errors and language variations, and dialogue-level perturbations that add non-informative exchanges.
We find that both fine-tuned and instruction-tuned models are affected by input variations, with the latter being more susceptible.
- Score: 13.749495524988774
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Dialogue summarization task involves summarizing long conversations while
preserving the most salient information. Real-life dialogues often involve
naturally occurring variations (e.g., repetitions, hesitations) and existing
dialogue summarization models suffer from performance drop on such
conversations. In this study, we systematically investigate the impact of such
variations on state-of-the-art dialogue summarization models using publicly
available datasets. To simulate real-life variations, we introduce two types of
perturbations: utterance-level perturbations that modify individual utterances
with errors and language variations, and dialogue-level perturbations that add
non-informative exchanges (e.g., repetitions, greetings). We conduct our
analysis along three dimensions of robustness: consistency, saliency, and
faithfulness, which capture different aspects of the summarization model's
performance. We find that both fine-tuned and instruction-tuned models are
affected by input variations, with the latter being more susceptible,
particularly to dialogue-level perturbations. We also validate our findings via
human evaluation. Finally, we investigate if the robustness of fine-tuned
models can be improved by training them with a fraction of perturbed data and
observe that this approach is insufficient to address robustness challenges
with current models and thus warrants a more thorough investigation to identify
better solutions. Overall, our work highlights robustness challenges in
dialogue summarization and provides insights for future research.
Related papers
- SPECTRUM: Speaker-Enhanced Pre-Training for Long Dialogue Summarization [48.284512017469524]
Multi-turn dialogues are characterized by their extended length and the presence of turn-taking conversations.
Traditional language models often overlook the distinct features of these dialogues by treating them as regular text.
We propose a speaker-enhanced pre-training method for long dialogue summarization.
arXiv Detail & Related papers (2024-01-31T04:50:00Z) - Pre-training Multi-party Dialogue Models with Latent Discourse Inference [85.9683181507206]
We pre-train a model that understands the discourse structure of multi-party dialogues, namely, to whom each utterance is replying.
To fully utilize the unlabeled data, we propose to treat the discourse structures as latent variables, then jointly infer them and pre-train the discourse-aware model.
arXiv Detail & Related papers (2023-05-24T14:06:27Z) - Analyzing and Evaluating Faithfulness in Dialogue Summarization [67.07947198421421]
We first perform the fine-grained human analysis on the faithfulness of dialogue summaries and observe that over 35% of generated summaries are faithfully inconsistent respective the source dialogues.
We present a new model-level faithfulness evaluation method. It examines generation models with multi-choice questions created by rule-based transformations.
arXiv Detail & Related papers (2022-10-21T07:22:43Z) - A Focused Study on Sequence Length for Dialogue Summarization [68.73335643440957]
We analyze the length differences between existing models' outputs and the corresponding human references.
We identify salient features for summary length prediction by comparing different model settings.
Third, we experiment with a length-aware summarizer and show notable improvement on existing models if summary length can be well incorporated.
arXiv Detail & Related papers (2022-09-24T02:49:48Z) - Learning Locality and Isotropy in Dialogue Modeling [28.743212772593335]
We propose a simple method for dialogue representation calibration, namely SimDRC, to build isotropic and conversational feature spaces.
Experimental results show that our approach significantly outperforms the current state-of-the-art models on three dialogue tasks.
arXiv Detail & Related papers (2022-05-29T06:48:53Z) - Coreference-Aware Dialogue Summarization [24.986030179701405]
We investigate approaches to explicitly incorporate coreference information in neural abstractive dialogue summarization models.
Experimental results show that the proposed approaches achieve state-of-the-art performance.
Evaluation results on factual correctness suggest such coreference-aware models are better at tracing the information flow among interlocutors.
arXiv Detail & Related papers (2021-06-16T05:18:50Z) - Robustness Testing of Language Understanding in Dialog Systems [33.30143655553583]
We conduct comprehensive evaluation and analysis with respect to the robustness of natural language understanding models.
We introduce three important aspects related to language understanding in real-world dialog systems, namely, language variety, speech characteristics, and noise perturbation.
We propose a model-agnostic toolkit LAUG to approximate natural perturbation for testing the robustness issues in dialog systems.
arXiv Detail & Related papers (2020-12-30T18:18:47Z) - I like fish, especially dolphins: Addressing Contradictions in Dialogue
Modeling [104.09033240889106]
We introduce the DialoguE COntradiction DEtection task (DECODE) and a new conversational dataset containing both human-human and human-bot contradictory dialogues.
We then compare a structured utterance-based approach of using pre-trained Transformer models for contradiction detection with the typical unstructured approach.
arXiv Detail & Related papers (2020-12-24T18:47:49Z) - Ranking Enhanced Dialogue Generation [77.8321855074999]
How to effectively utilize the dialogue history is a crucial problem in multi-turn dialogue generation.
Previous works usually employ various neural network architectures to model the history.
This paper proposes a Ranking Enhanced Dialogue generation framework.
arXiv Detail & Related papers (2020-08-13T01:49:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.