COM2SENSE: A Commonsense Reasoning Benchmark with Complementary
Sentences
- URL: http://arxiv.org/abs/2106.00969v1
- Date: Wed, 2 Jun 2021 06:31:55 GMT
- Title: COM2SENSE: A Commonsense Reasoning Benchmark with Complementary
Sentences
- Authors: Shikhar Singh, Nuan Wen, Yu Hou, Pegah Alipoormolabashi, Te-Lin Wu,
Xuezhe Ma, Nanyun Peng
- Abstract summary: Commonsense reasoning is intuitive for humans but has been a long-term challenge for artificial intelligence (AI)
Recent advancements in pretrained language models have shown promising results on several commonsense benchmark datasets.
We introduce a new commonsense reasoning benchmark dataset comprising natural language true/false statements.
- Score: 21.11065466376105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Commonsense reasoning is intuitive for humans but has been a long-term
challenge for artificial intelligence (AI). Recent advancements in pretrained
language models have shown promising results on several commonsense benchmark
datasets. However, the reliability and comprehensiveness of these benchmarks
towards assessing model's commonsense reasoning ability remains unclear. To
this end, we introduce a new commonsense reasoning benchmark dataset comprising
natural language true/false statements, with each sample paired with its
complementary counterpart, resulting in 4k sentence pairs. We propose a
pairwise accuracy metric to reliably measure an agent's ability to perform
commonsense reasoning over a given situation. The dataset is crowdsourced and
enhanced with an adversarial model-in-the-loop setup to incentivize challenging
samples. To facilitate a systematic analysis of commonsense capabilities, we
design our dataset along the dimensions of knowledge domains, reasoning
scenarios and numeracy. Experimental results demonstrate that our strongest
baseline (UnifiedQA-3B), after fine-tuning, achieves ~71% standard accuracy and
~51% pairwise accuracy, well below human performance (~95% for both metrics).
The dataset is available at https://github.com/PlusLabNLP/Com2Sense.
Related papers
- Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review.
A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods.
We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z) - Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data [89.2410799619405]
We introduce the Quantitative Reasoning with Data benchmark to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data.
The benchmark comprises a dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers.
To compare models' quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText.
arXiv Detail & Related papers (2024-02-27T16:15:03Z) - Advancing Transformer's Capabilities in Commonsense Reasoning [6.5798066703568105]
We introduce current ML-based methods to improve general purpose pre-trained language models in the task of commonsense reasoning.
Our best model outperforms the strongest previous works by 15% absolute gains in Pairwise Accuracy and 8.7% absolute gains in Standard Accuracy.
arXiv Detail & Related papers (2023-10-10T17:21:03Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Is Synthetic Dataset Reliable for Benchmarking Generalizable Person
Re-Identification? [1.1041211464412568]
We show that a recent large-scale synthetic dataset ClonedPerson can be reliably used to benchmark GPReID, statistically the same as real-world datasets.
This study guarantees the usage of synthetic datasets for both source training set and target testing set, with completely no privacy concerns from real-world surveillance data.
arXiv Detail & Related papers (2022-09-12T06:54:54Z) - Combining human parsing with analytical feature extraction and ranking
schemes for high-generalization person reidentification [0.0]
Person reidentification (re-ID) has been receiving increasing attention in recent years due to its importance for both science and society.
Machine learning and particularly Deep Learning (DL) has become the main re-id tool that allowed researches to achieve unprecedented accuracy levels on benchmark datasets.
We present a model without trainable parameters which shows great potential for high generalization.
arXiv Detail & Related papers (2022-07-28T17:22:48Z) - Uncertainty Estimation for Language Reward Models [5.33024001730262]
Language models can learn a range of capabilities from unsupervised training on text corpora.
It is often easier for humans to choose between options than to provide labeled data, and prior work has achieved state-of-the-art performance by training a reward model from such preference comparisons.
We seek to address these problems via uncertainty estimation, which can improve sample efficiency and robustness using active learning and risk-averse reinforcement learning.
arXiv Detail & Related papers (2022-03-14T20:13:21Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.