Measuring and Improving Semantic Diversity of Dialogue Generation
        - URL: http://arxiv.org/abs/2210.05725v1
- Date: Tue, 11 Oct 2022 18:36:54 GMT
- Title: Measuring and Improving Semantic Diversity of Dialogue Generation
- Authors: Seungju Han, Beomsu Kim, Buru Chang
- Abstract summary: We introduce a new automatic evaluation metric to measure the semantic diversity of generated responses.
We show that our proposed metric captures human judgments on response diversity better than existing lexical-level diversity metrics.
We also propose a simple yet effective learning method that improves the semantic diversity of generated responses.
- Score: 21.59385143783728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Response diversity has become an important criterion for evaluating the
quality of open-domain dialogue generation models. However, current evaluation
metrics for response diversity often fail to capture the semantic diversity of
generated responses, as they mainly consider lexical aspects of the generated
responses. In this paper, we introduce a new automatic evaluation metric to
measure the semantic diversity of generated responses. Through human
evaluation, we demonstrate that our proposed metric captures human judgments on
response diversity better than existing lexical-level diversity metrics.
Furthermore, motivated by analyzing an existing dialogue dataset, we propose a
simple yet effective learning method that improves the semantic diversity of
generated responses. Our learning method weights training samples based on the
semantic distribution of the training set. We show that our learning method
improves response diversity and coherency better than other baseline methods
through automatic and human evaluation.
 
      
        Related papers
        - Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text   Generation with Uncertainty-Based Active Learning [63.531262595858]
 Divide-and-conquer approach breaks comprehensive evaluation task into localized scoring tasks, followed by a final global assessment.<n>We introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations.<n>Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation.
 arXiv  Detail & Related papers  (2025-05-26T16:39:41Z)
- Evaluating the Diversity and Quality of LLM Generated Content [72.84945252821908]
 We introduce a framework for measuring effective semantic diversity--diversity among outputs that meet quality thresholds.
Although preference-tuned models exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models.
These findings have important implications for applications that require diverse yet high-quality outputs.
 arXiv  Detail & Related papers  (2025-04-16T23:02:23Z)
- HREF: Human Response-Guided Evaluation of Instruction Following in   Language Models [61.273153125847166]
 We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)
In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.
We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
 arXiv  Detail & Related papers  (2024-12-20T03:26:47Z)
- PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
 We propose PersLLM, integrating psychology-grounded principles of personality: social practice, consistency, and dynamic development.
We incorporate personality traits directly into the model parameters, enhancing the model's resistance to induction, promoting consistency, and supporting the dynamic evolution of personality.
 arXiv  Detail & Related papers  (2024-07-17T08:13:22Z)
- Improving Diversity of Demographic Representation in Large Language
  Models via Collective-Critiques and Self-Voting [19.79214899011072]
 This paper formalizes diversity of representation in generative large language models.
We present evaluation datasets and propose metrics to measure diversity in generated responses along people and culture axes.
We find that LLMs understand the notion of diversity, and that they can reason and critique their own responses for that goal.
 arXiv  Detail & Related papers  (2023-10-25T10:17:17Z)
- Diverse and Faithful Knowledge-Grounded Dialogue Generation via
  Sequential Posterior Inference [82.28542500317445]
 We present an end-to-end learning framework, termed Sequential Posterior Inference (SPI), capable of selecting knowledge and generating dialogues.
Unlike other methods, SPI does not require the inference network or assume a simple geometry of the posterior distribution.
 arXiv  Detail & Related papers  (2023-06-01T21:23:13Z)
- Generate, Evaluate, and Select: A Dialogue System with a Response
  Evaluator for Diversity-Aware Response Generation [9.247397520986999]
 We aim to overcome the lack of diversity in responses of current dialogue systems.
We propose a generator-evaluator model that evaluates multiple responses generated by a response generator.
We conduct human evaluations to compare the output of the proposed system with that of a baseline system.
 arXiv  Detail & Related papers  (2022-06-10T08:22:22Z)
- Semantic Diversity in Dialogue with Natural Language Inference [19.74618235525502]
 This paper makes two substantial contributions to improving diversity in dialogue generation.
First, we propose a novel metric which uses Natural Language Inference (NLI) to measure the semantic diversity of a set of model responses for a conversation.
Second, we demonstrate how to iteratively improve the semantic diversity of a sampled set of responses via a new generation procedure called Diversity Threshold Generation.
 arXiv  Detail & Related papers  (2022-05-03T13:56:32Z)
- Evaluation of Self-taught Learning-based Representations for Facial
  Emotion Recognition [62.30451764345482]
 This work describes different strategies to generate unsupervised representations obtained through the concept of self-taught learning for facial emotion recognition.
The idea is to create complementary representations promoting diversity by varying the autoencoders' initialization, architecture, and training data.
 Experimental results on Jaffe and Cohn-Kanade datasets using a leave-one-subject-out protocol show that FER methods based on the proposed diverse representations compare favorably against state-of-the-art approaches.
 arXiv  Detail & Related papers  (2022-04-26T22:48:15Z)
- Towards Robust Online Dialogue Response Generation [62.99904593650087]
 We argue that this can be caused by a discrepancy between training and real-world testing.
We propose a hierarchical sampling-based method consisting of both utterance-level sampling and semi-utterance-level sampling.
 arXiv  Detail & Related papers  (2022-03-07T06:51:41Z)
- Unsupervised Domain Adaptive Person Re-Identification via Human Learning
  Imitation [67.52229938775294]
 In past years, researchers propose to utilize the teacher-student framework in their methods to decrease the domain gap between different person re-identification datasets.
Inspired by recent teacher-student framework based methods, we propose to conduct further exploration to imitate the human learning process from different aspects.
 arXiv  Detail & Related papers  (2021-11-28T01:14:29Z)
- Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
  Evaluation Approach [84.02388020258141]
 We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
 ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
 arXiv  Detail & Related papers  (2021-02-20T03:29:20Z)
- Evaluating for Diversity in Question Generation over Text [5.369031521471668]
 We argue that commonly-used evaluation metrics such as BLEU and METEOR are not suitable for this task due to the inherent diversity of reference questions.
We propose a variational encoder-decoder model for this task.
 arXiv  Detail & Related papers  (2020-08-17T13:16:12Z)
- Evaluating the Evaluation of Diversity in Natural Language Generation [43.05127848086264]
 We propose a framework for evaluating diversity metrics in natural language generation systems.
Our framework can advance the understanding of different diversity metrics, an essential step on the road towards better NLG systems.
 arXiv  Detail & Related papers  (2020-04-06T20:44:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.