Related papers: Benchmarking Long-tail Generalization with Likelihood Splits

Benchmarking Long-tail Generalization with Likelihood Splits

URL: http://arxiv.org/abs/2210.06799v2
Date: Tue, 2 May 2023 10:05:57 GMT
Title: Benchmarking Long-tail Generalization with Likelihood Splits
Authors: Ameya Godbole, Robin Jia
Abstract summary: We propose a method to create challenging benchmarks that require generalizing to the tail of the distribution by re-splitting existing datasets. We create 'Likelihood Splits' where examples that are assigned lower likelihood by a pre-trained language model are placed in the test set, and more likely examples are in the training set.
Score: 20.47194488430863
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In order to reliably process natural language, NLP systems must generalize to the long tail of rare utterances. We propose a method to create challenging benchmarks that require generalizing to the tail of the distribution by re-splitting existing datasets. We create 'Likelihood Splits' where examples that are assigned lower likelihood by a pre-trained language model (LM) are placed in the test set, and more likely examples are in the training set. This simple approach can be customized to construct meaningful train-test splits for a wide range of tasks. Likelihood Splits surface more challenges than random splits: relative error rates of state-of-the-art models increase by 59% for semantic parsing on Spider, 93% for natural language inference on SNLI, and 33% for yes/no question answering on BoolQ, on our splits compared with the corresponding random splits. Moreover, Likelihood Splits create fairer benchmarks than adversarial filtering; when the LM used to create the splits is also employed as the task model, our splits do not unfairly penalize the LM.

Related papers

Paloma: A Benchmark for Evaluating Language Model Fit [114.63031978259467]
Language Model Assessment (Paloma) measures fit to 585 text domains. We populate our benchmark with results from baselines pretrained on popular corpora.
arXiv Detail & Related papers (2023-12-16T19:12:45Z)
Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models [65.52639709094963]
Methods such as beam search and Gumbel top-k sampling can guarantee a different output for each element of the beam, but are not easy to parallelize. We present a framework for sampling according to an arithmetic code book implicitly defined by a large language model.
arXiv Detail & Related papers (2022-10-18T22:19:41Z)
SeqZero: Few-shot Compositional Semantic Parsing with Sequential Prompts and Zero-shot Models [57.29358388475983]
Recent research showed promising results on combining pretrained language models with canonical utterance. We propose a novel few-shot semantic parsing method -- SeqZero. In particular, SeqZero brings out the merits from both models via ensemble equipped with our proposed constrained rescaling.
arXiv Detail & Related papers (2022-05-15T21:13:15Z)
Learning to Split for Automatic Bias Detection [39.353850990332525]
Learning to Split (ls) is an algorithm for automatic bias detection. We evaluate our approach on Beer Review, CelebA and MNLI.
arXiv Detail & Related papers (2022-04-28T19:41:08Z)
Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios. We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z)
A Conditional Splitting Framework for Efficient Constituency Parsing [14.548146390081778]
We introduce a generic seq2seq parsing framework that casts constituency parsing problems (syntactic and discourse parsing) into a series of conditional splitting decisions. Our parsing model estimates the conditional probability distribution of possible splitting points in a given text span and supports efficient top-down decoding. For discourse analysis we show that in our formulation, discourse segmentation can be framed as a special case of parsing.
arXiv Detail & Related papers (2021-06-30T00:36:34Z)
Examining and Combating Spurious Features under Distribution Shift [94.31956965507085]
We define and analyze robust and spurious representations using the information-theoretic concept of minimal sufficient statistics. We prove that even when there is only bias of the input distribution, models can still pick up spurious features from their training data. Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations.
arXiv Detail & Related papers (2021-06-14T05:39:09Z)
We Need to Talk About Random Splits [3.236124102160291]
Gorman and Bedrick argued for using random splits rather than standard splits in NLP experiments. We argue that random splits, like standard splits, lead to overly optimistic performance estimates.
arXiv Detail & Related papers (2020-05-01T22:14:16Z)
Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters. We infer the posteriors over such latent variables based on data from seen task-language combinations. Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
Fact-aware Sentence Split and Rephrase with Permutation Invariant Training [93.66323661321113]
Sentence Split and Rephrase aims to break down a complex sentence into several simple sentences with its meaning preserved. Previous studies tend to address the issue by seq2seq learning from parallel sentence pairs. We introduce Permutation Training to verifies the effects of order variance in seq2seq learning for this task.
arXiv Detail & Related papers (2020-01-16T07:30:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.