Distribution Aware Metrics for Conditional Natural Language Generation
- URL: http://arxiv.org/abs/2209.07518v1
- Date: Thu, 15 Sep 2022 17:58:13 GMT
- Title: Distribution Aware Metrics for Conditional Natural Language Generation
- Authors: David M Chan, Yiming Ni, Austin Myers, Sudheendra Vijayanarasimhan,
David A Ross, John Canny
- Abstract summary: We argue that existing metrics are not appropriate for domains such as visual description or summarization where ground truths are semantically diverse.
We propose a novel paradigm for multi-candidate evaluation of conditional language generation models.
- Score: 3.6350564275444173
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional automated metrics for evaluating conditional natural language
generation use pairwise comparisons between a single generated text and the
best-matching gold-standard ground truth text. When multiple ground truths are
available, scores are aggregated using an average or max operation across
references. While this approach works well when diversity in the ground truth
data (i.e. dispersion of the distribution of conditional texts) can be ascribed
to noise, such as in automated speech recognition, it does not allow for robust
evaluation in the case where diversity in the ground truths represents signal
for the model. In this work we argue that existing metrics are not appropriate
for domains such as visual description or summarization where ground truths are
semantically diverse, and where the diversity in those captions captures useful
additional information about the context. We propose a novel paradigm for
multi-candidate evaluation of conditional language generation models, and a new
family of metrics that compare the distributions of reference and
model-generated caption sets using small sample sets of each. We demonstrate
the utility of our approach with a case study in visual description: where we
show that existing models optimize for single-description quality over
diversity, and gain some insights into how sampling methods and temperature
impact description quality and diversity.
Related papers
- Conditional Vendi Score: An Information-Theoretic Approach to Diversity Evaluation of Prompt-based Generative Models [15.40817940713399]
We introduce the Conditional-Vendi score based on $H(X|T)$ to quantify the internal diversity of the model.
We conduct several numerical experiments to show the correlation between the Conditional-Vendi score and the internal diversity of text-conditioned generative models.
arXiv Detail & Related papers (2024-11-05T05:30:39Z) - Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation.
Our approach can be applied to existing datasets by automatically generating hard negative test captions.
Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z) - Explaining Datasets in Words: Statistical Models with Natural Language Parameters [66.69456696878842]
We introduce a family of statistical models -- including clustering, time series, and classification models -- parameterized by natural language predicates.
We apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other.
arXiv Detail & Related papers (2024-09-13T01:40:20Z) - Are we describing the same sound? An analysis of word embedding spaces
of expressive piano performance [4.867952721052875]
We investigate the uncertainty for the domain of characterizations of expressive piano performance.
We test five embedding models and their similarity structure for correspondence with the ground truth.
The quality of embedding models shows great variability with respect to this task.
arXiv Detail & Related papers (2023-12-31T12:20:03Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - What's in a Caption? Dataset-Specific Linguistic Diversity and Its
Effect on Visual Description Models and Metrics [14.624063829492764]
We find that caption diversity is a major driving factor behind the generation of generic and uninformative captions.
We show that state-of-the-art models even outperform held-out ground truth captions on modern metrics.
arXiv Detail & Related papers (2022-05-12T17:55:08Z) - Disentangling Generative Factors in Natural Language with Discrete
Variational Autoencoders [0.0]
We argue that continuous variables may not be ideal to model features of textual data, due to the fact that most generative factors in text are discrete.
We propose a Variational Autoencoder based method which models language features as discrete variables and encourages independence between variables for learning disentangled representations.
arXiv Detail & Related papers (2021-09-15T09:10:05Z) - Diverse Semantic Image Synthesis via Probability Distribution Modeling [103.88931623488088]
We propose a novel diverse semantic image synthesis framework.
Our method can achieve superior diversity and comparable quality compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-03-11T18:59:25Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.