Benchmarking Long-tail Generalization with Likelihood Splits
- URL: http://arxiv.org/abs/2210.06799v2
- Date: Tue, 2 May 2023 10:05:57 GMT
- Title: Benchmarking Long-tail Generalization with Likelihood Splits
- Authors: Ameya Godbole, Robin Jia
- Abstract summary: We propose a method to create challenging benchmarks that require generalizing to the tail of the distribution by re-splitting existing datasets.
We create 'Likelihood Splits' where examples that are assigned lower likelihood by a pre-trained language model are placed in the test set, and more likely examples are in the training set.
- Score: 20.47194488430863
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In order to reliably process natural language, NLP systems must generalize to
the long tail of rare utterances. We propose a method to create challenging
benchmarks that require generalizing to the tail of the distribution by
re-splitting existing datasets. We create 'Likelihood Splits' where examples
that are assigned lower likelihood by a pre-trained language model (LM) are
placed in the test set, and more likely examples are in the training set. This
simple approach can be customized to construct meaningful train-test splits for
a wide range of tasks. Likelihood Splits surface more challenges than random
splits: relative error rates of state-of-the-art models increase by 59% for
semantic parsing on Spider, 93% for natural language inference on SNLI, and 33%
for yes/no question answering on BoolQ, on our splits compared with the
corresponding random splits. Moreover, Likelihood Splits create fairer
benchmarks than adversarial filtering; when the LM used to create the splits is
also employed as the task model, our splits do not unfairly penalize the LM.
Related papers
- Paloma: A Benchmark for Evaluating Language Model Fit [114.63031978259467]
Language Model Assessment (Paloma) measures fit to 585 text domains.
We populate our benchmark with results from baselines pretrained on popular corpora.
arXiv Detail & Related papers (2023-12-16T19:12:45Z) - Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models [65.52639709094963]
Methods such as beam search and Gumbel top-k sampling can guarantee a different output for each element of the beam, but are not easy to parallelize.
We present a framework for sampling according to an arithmetic code book implicitly defined by a large language model.
arXiv Detail & Related papers (2022-10-18T22:19:41Z) - SeqZero: Few-shot Compositional Semantic Parsing with Sequential Prompts
and Zero-shot Models [57.29358388475983]
Recent research showed promising results on combining pretrained language models with canonical utterance.
We propose a novel few-shot semantic parsing method -- SeqZero.
In particular, SeqZero brings out the merits from both models via ensemble equipped with our proposed constrained rescaling.
arXiv Detail & Related papers (2022-05-15T21:13:15Z) - Learning to Split for Automatic Bias Detection [39.353850990332525]
Learning to Split (ls) is an algorithm for automatic bias detection.
We evaluate our approach on Beer Review, CelebA and MNLI.
arXiv Detail & Related papers (2022-04-28T19:41:08Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - A Conditional Splitting Framework for Efficient Constituency Parsing [14.548146390081778]
We introduce a generic seq2seq parsing framework that casts constituency parsing problems (syntactic and discourse parsing) into a series of conditional splitting decisions.
Our parsing model estimates the conditional probability distribution of possible splitting points in a given text span and supports efficient top-down decoding.
For discourse analysis we show that in our formulation, discourse segmentation can be framed as a special case of parsing.
arXiv Detail & Related papers (2021-06-30T00:36:34Z) - Examining and Combating Spurious Features under Distribution Shift [94.31956965507085]
We define and analyze robust and spurious representations using the information-theoretic concept of minimal sufficient statistics.
We prove that even when there is only bias of the input distribution, models can still pick up spurious features from their training data.
Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations.
arXiv Detail & Related papers (2021-06-14T05:39:09Z) - We Need to Talk About Random Splits [3.236124102160291]
Gorman and Bedrick argued for using random splits rather than standard splits in NLP experiments.
We argue that random splits, like standard splits, lead to overly optimistic performance estimates.
arXiv Detail & Related papers (2020-05-01T22:14:16Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.