Related papers: Distribution Aware Metrics for Conditional Natural Language Generation

Distribution Aware Metrics for Conditional Natural Language Generation

URL: http://arxiv.org/abs/2209.07518v1
Date: Thu, 15 Sep 2022 17:58:13 GMT
Title: Distribution Aware Metrics for Conditional Natural Language Generation
Authors: David M Chan, Yiming Ni, Austin Myers, Sudheendra Vijayanarasimhan, David A Ross, John Canny
Abstract summary: We argue that existing metrics are not appropriate for domains such as visual description or summarization where ground truths are semantically diverse. We propose a novel paradigm for multi-candidate evaluation of conditional language generation models.
Score: 3.6350564275444173
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Traditional automated metrics for evaluating conditional natural language generation use pairwise comparisons between a single generated text and the best-matching gold-standard ground truth text. When multiple ground truths are available, scores are aggregated using an average or max operation across references. While this approach works well when diversity in the ground truth data (i.e. dispersion of the distribution of conditional texts) can be ascribed to noise, such as in automated speech recognition, it does not allow for robust evaluation in the case where diversity in the ground truths represents signal for the model. In this work we argue that existing metrics are not appropriate for domains such as visual description or summarization where ground truths are semantically diverse, and where the diversity in those captions captures useful additional information about the context. We propose a novel paradigm for multi-candidate evaluation of conditional language generation models, and a new family of metrics that compare the distributions of reference and model-generated caption sets using small sample sets of each. We demonstrate the utility of our approach with a case study in visual description: where we show that existing models optimize for single-description quality over diversity, and gain some insights into how sampling methods and temperature impact description quality and diversity.

Related papers

Conditional Vendi Score: An Information-Theoretic Approach to Diversity Evaluation of Prompt-based Generative Models [15.40817940713399]
We introduce the Conditional-Vendi score based on $H(X|T)$ to quantify the internal diversity of the model. We conduct several numerical experiments to show the correlation between the Conditional-Vendi score and the internal diversity of text-conditioned generative models.
arXiv Detail & Related papers (2024-11-05T05:30:39Z)
Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions. Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z)
Explaining Datasets in Words: Statistical Models with Natural Language Parameters [66.69456696878842]
We introduce a family of statistical models -- including clustering, time series, and classification models -- parameterized by natural language predicates. We apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other.
arXiv Detail & Related papers (2024-09-13T01:40:20Z)
Are we describing the same sound? An analysis of word embedding spaces of expressive piano performance [4.867952721052875]
We investigate the uncertainty for the domain of characterizations of expressive piano performance. We test five embedding models and their similarity structure for correspondence with the ground truth. The quality of embedding models shows great variability with respect to this task.
arXiv Detail & Related papers (2023-12-31T12:20:03Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics [14.624063829492764]
We find that caption diversity is a major driving factor behind the generation of generic and uninformative captions. We show that state-of-the-art models even outperform held-out ground truth captions on modern metrics.
arXiv Detail & Related papers (2022-05-12T17:55:08Z)
Disentangling Generative Factors in Natural Language with Discrete Variational Autoencoders [0.0]
We argue that continuous variables may not be ideal to model features of textual data, due to the fact that most generative factors in text are discrete. We propose a Variational Autoencoder based method which models language features as discrete variables and encourages independence between variables for learning disentangled representations.
arXiv Detail & Related papers (2021-09-15T09:10:05Z)
Diverse Semantic Image Synthesis via Probability Distribution Modeling [103.88931623488088]
We propose a novel diverse semantic image synthesis framework. Our method can achieve superior diversity and comparable quality compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-03-11T18:59:25Z)
Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task. The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them. By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z)
XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word. We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages. Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.