RICA: Evaluating Robust Inference Capabilities Based on Commonsense
Axioms
- URL: http://arxiv.org/abs/2005.00782v4
- Date: Fri, 10 Sep 2021 01:37:12 GMT
- Title: RICA: Evaluating Robust Inference Capabilities Based on Commonsense
Axioms
- Authors: Pei Zhou, Rahul Khanna, Seyeon Lee, Bill Yuchen Lin, Daniel Ho, Jay
Pujara, Xiang Ren
- Abstract summary: We propose a new challenge, RICA: Robust Inference capability based on Commonsense Axioms.
We generate data for this challenge using commonsense knowledge bases and probe PTLMs across two different evaluation settings.
Experiments show that PTLMs perform no better than random guessing on the zero-shot setting, are heavily impacted by statistical biases, and are not robust to perturbation attacks.
- Score: 41.82685006832153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models (PTLMs) have achieved impressive performance on
commonsense inference benchmarks, but their ability to employ commonsense to
make robust inferences, which is crucial for effective communications with
humans, is debated. In the pursuit of advancing fluid human-AI communication,
we propose a new challenge, RICA: Robust Inference capability based on
Commonsense Axioms, that evaluates robust commonsense inference despite textual
perturbations. To generate data for this challenge, we develop a systematic and
scalable procedure using commonsense knowledge bases and probe PTLMs across two
different evaluation settings. Extensive experiments on our generated probe
sets with more than 10k statements show that PTLMs perform no better than
random guessing on the zero-shot setting, are heavily impacted by statistical
biases, and are not robust to perturbation attacks. We also find that
fine-tuning on similar statements offer limited gains, as PTLMs still fail to
generalize to unseen inferences. Our new large-scale benchmark exposes a
significant gap between PTLMs and human-level language understanding and offers
a new challenge for PTLMs to demonstrate commonsense.
Related papers
- CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models [5.409370027524351]
We evaluate the performance of large language models (LLMs) in counterfactual reasoning.
We introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions.
arXiv Detail & Related papers (2025-02-16T06:19:37Z) - RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models [12.112914393948415]
We present RUPBench, a benchmark designed to evaluate large language models (LLMs) across diverse reasoning tasks.
Our benchmark incorporates 15 reasoning datasets, categorized into commonsense, arithmetic, logical, and knowledge-intensive reasoning.
By examining the performance of state-of-the-art LLMs such as GPT-4o, Llama3, Phi-3, and Gemma on both original and perturbed datasets, we provide a detailed analysis of their robustness and error patterns.
arXiv Detail & Related papers (2024-06-16T17:26:44Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Exploring the Physical World Adversarial Robustness of Vehicle Detection [13.588120545886229]
Adrial attacks can compromise the robustness of real-world detection models.
We propose an innovative instant-level data generation pipeline using the CARLA simulator.
Our findings highlight diverse model performances under adversarial conditions.
arXiv Detail & Related papers (2023-08-07T11:09:12Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Fair Robust Active Learning by Joint Inconsistency [22.150782414035422]
We introduce a novel task, Fair Robust Active Learning (FRAL), integrating conventional FAL and adversarial robustness.
We develop a simple yet effective FRAL strategy by Joint INconsistency (JIN)
Our method exploits the prediction inconsistency between benign and adversarial samples as well as between standard and robust models.
arXiv Detail & Related papers (2022-09-22T01:56:41Z) - Evaluate Confidence Instead of Perplexity for Zero-shot Commonsense
Reasoning [85.1541170468617]
This paper reconsiders the nature of commonsense reasoning and proposes a novel commonsense reasoning metric, Non-Replacement Confidence (NRC)
Our proposed novel method boosts zero-shot performance on two commonsense reasoning benchmark datasets and further seven commonsense question-answering datasets.
arXiv Detail & Related papers (2022-08-23T14:42:14Z) - Characterizing the adversarial vulnerability of speech self-supervised
learning [95.03389072594243]
We make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries.
The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries.
arXiv Detail & Related papers (2021-11-08T08:44:04Z) - A Simple but Tough-to-Beat Data Augmentation Approach for Natural
Language Understanding and Generation [53.8171136907856]
We introduce a set of simple yet effective data augmentation strategies dubbed cutoff.
cutoff relies on sampling consistency and thus adds little computational overhead.
cutoff consistently outperforms adversarial training and achieves state-of-the-art results on the IWSLT2014 German-English dataset.
arXiv Detail & Related papers (2020-09-29T07:08:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.