Distribution Aware Metrics for Conditional Natural Language Generation
- URL: http://arxiv.org/abs/2209.07518v1
- Date: Thu, 15 Sep 2022 17:58:13 GMT
- Title: Distribution Aware Metrics for Conditional Natural Language Generation
- Authors: David M Chan, Yiming Ni, Austin Myers, Sudheendra Vijayanarasimhan,
David A Ross, John Canny
- Abstract summary: We argue that existing metrics are not appropriate for domains such as visual description or summarization where ground truths are semantically diverse.
We propose a novel paradigm for multi-candidate evaluation of conditional language generation models.
- Score: 3.6350564275444173
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional automated metrics for evaluating conditional natural language
generation use pairwise comparisons between a single generated text and the
best-matching gold-standard ground truth text. When multiple ground truths are
available, scores are aggregated using an average or max operation across
references. While this approach works well when diversity in the ground truth
data (i.e. dispersion of the distribution of conditional texts) can be ascribed
to noise, such as in automated speech recognition, it does not allow for robust
evaluation in the case where diversity in the ground truths represents signal
for the model. In this work we argue that existing metrics are not appropriate
for domains such as visual description or summarization where ground truths are
semantically diverse, and where the diversity in those captions captures useful
additional information about the context. We propose a novel paradigm for
multi-candidate evaluation of conditional language generation models, and a new
family of metrics that compare the distributions of reference and
model-generated caption sets using small sample sets of each. We demonstrate
the utility of our approach with a case study in visual description: where we
show that existing models optimize for single-description quality over
diversity, and gain some insights into how sampling methods and temperature
impact description quality and diversity.
Related papers
- Are we describing the same sound? An analysis of word embedding spaces
of expressive piano performance [4.867952721052875]
We investigate the uncertainty for the domain of characterizations of expressive piano performance.
We test five embedding models and their similarity structure for correspondence with the ground truth.
The quality of embedding models shows great variability with respect to this task.
arXiv Detail & Related papers (2023-12-31T12:20:03Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Predicting Out-of-Domain Generalization with Neighborhood Invariance [59.05399533508682]
We propose a measure of a classifier's output invariance in a local transformation neighborhood.
Our measure is simple to calculate, does not depend on the test point's true label, and can be applied even in out-of-domain (OOD) settings.
In experiments on benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our measure and actual OOD generalization.
arXiv Detail & Related papers (2022-07-05T14:55:16Z) - BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and
Semantic Parsing [55.058258437125524]
We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing.
We benchmark eight language models, including two GPT-3 variants available only through an API.
Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.
arXiv Detail & Related papers (2022-06-21T18:34:11Z) - What's in a Caption? Dataset-Specific Linguistic Diversity and Its
Effect on Visual Description Models and Metrics [14.624063829492764]
We find that caption diversity is a major driving factor behind the generation of generic and uninformative captions.
We show that state-of-the-art models even outperform held-out ground truth captions on modern metrics.
arXiv Detail & Related papers (2022-05-12T17:55:08Z) - Disentangling Generative Factors in Natural Language with Discrete
Variational Autoencoders [0.0]
We argue that continuous variables may not be ideal to model features of textual data, due to the fact that most generative factors in text are discrete.
We propose a Variational Autoencoder based method which models language features as discrete variables and encourages independence between variables for learning disentangled representations.
arXiv Detail & Related papers (2021-09-15T09:10:05Z) - Generating Diverse Descriptions from Semantic Graphs [38.28044884015192]
We present a graph-to-text model, incorporating a latent variable in an an-decoder model, and its use in an ensemble.
We show an ensemble of models produces diverse sets of generated sentences, while retaining similar quality to state-of-the-art models.
We evaluate the models on WebNLG datasets in English and Russian, and show an ensemble of models produces diverse sets of generated sentences, while retaining similar quality to state-of-the-art models.
arXiv Detail & Related papers (2021-08-12T11:00:09Z) - Diverse Semantic Image Synthesis via Probability Distribution Modeling [103.88931623488088]
We propose a novel diverse semantic image synthesis framework.
Our method can achieve superior diversity and comparable quality compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-03-11T18:59:25Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.