Reliable Evaluations for Natural Language Inference based on a Unified
Cross-dataset Benchmark
- URL: http://arxiv.org/abs/2010.07676v1
- Date: Thu, 15 Oct 2020 11:50:12 GMT
- Title: Reliable Evaluations for Natural Language Inference based on a Unified
Cross-dataset Benchmark
- Authors: Guanhua Zhang, Bing Bai, Jian Liang, Kun Bai, Conghui Zhu, Tiejun Zhao
- Abstract summary: Crowd-sourced Natural Language Inference (NLI) datasets may suffer from significant biases like annotation artifacts.
We present a new unified cross-datasets benchmark with 14 NLI datasets and re-evaluate 9 widely-used neural network-based NLI models.
Our proposed evaluation scheme and experimental baselines could provide a basis to inspire future reliable NLI research.
- Score: 54.782397511033345
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies show that crowd-sourced Natural Language Inference (NLI)
datasets may suffer from significant biases like annotation artifacts. Models
utilizing these superficial clues gain mirage advantages on the in-domain
testing set, which makes the evaluation results over-estimated. The lack of
trustworthy evaluation settings and benchmarks stalls the progress of NLI
research. In this paper, we propose to assess a model's trustworthy
generalization performance with cross-datasets evaluation. We present a new
unified cross-datasets benchmark with 14 NLI datasets, and re-evaluate 9
widely-used neural network-based NLI models as well as 5 recently proposed
debiasing methods for annotation artifacts. Our proposed evaluation scheme and
experimental baselines could provide a basis to inspire future reliable NLI
research.
Related papers
- Debias Can be Unreliable: Mitigating Bias Issue in Evaluating Debiasing Recommendation [34.19561411584444]
Traditional evaluation scheme is not suitable for randomly-exposed datasets.
We propose the Unbiased Recall Evaluation scheme, which adjusts the utilization of randomly-exposed datasets to unbiasedly estimate the true Recall performance.
arXiv Detail & Related papers (2024-09-07T12:42:58Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Sample Complexity of Preference-Based Nonparametric Off-Policy
Evaluation with Deep Networks [58.469818546042696]
We study the sample efficiency of OPE with human preference and establish a statistical guarantee for it.
By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2023-10-16T16:27:06Z) - Unsupervised evaluation of GAN sample quality: Introducing the TTJac
Score [5.1359892878090845]
"TTJac score" is proposed to measure the fidelity of individual synthesized images in a data-free manner.
The experimental results of applying the proposed metric to StyleGAN 2 and StyleGAN 2 ADA models on FFHQ, AFHQ-Wild, LSUN-Cars, and LSUN-Horse datasets are presented.
arXiv Detail & Related papers (2023-08-31T19:55:50Z) - Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls
and New Benchmarking [66.83273589348758]
Link prediction attempts to predict whether an unseen edge exists based on only a portion of edges of a graph.
A flurry of methods have been introduced in recent years that attempt to make use of graph neural networks (GNNs) for this task.
New and diverse datasets have also been created to better evaluate the effectiveness of these new models.
arXiv Detail & Related papers (2023-06-18T01:58:59Z) - DATa: Domain Adaptation-Aided Deep Table Detection Using Visual-Lexical
Representations [2.542864854772221]
We present a novel Domain Adaptation-aided deep Table detection method called DATa.
It guarantees satisfactory performance in a specific target domain where few trusted labels are available.
Experiments show that DATa substantially outperforms competing methods that only utilize visual representations in the target domain.
arXiv Detail & Related papers (2022-11-12T12:14:16Z) - Stretching Sentence-pair NLI Models to Reason over Long Documents and
Clusters [35.103851212995046]
Natural Language Inference (NLI) has been extensively studied by the NLP community as a framework for estimating the semantic relation between sentence pairs.
We explore the direct zero-shot applicability of NLI models to real applications, beyond the sentence-pair setting they were trained on.
We develop new aggregation methods to allow operating over full documents, reaching state-of-the-art performance on the ContractNLI dataset.
arXiv Detail & Related papers (2022-04-15T12:56:39Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.