Related papers: RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms

RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms

URL: http://arxiv.org/abs/2005.00782v4
Date: Fri, 10 Sep 2021 01:37:12 GMT
Title: RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms
Authors: Pei Zhou, Rahul Khanna, Seyeon Lee, Bill Yuchen Lin, Daniel Ho, Jay Pujara, Xiang Ren
Abstract summary: We propose a new challenge, RICA: Robust Inference capability based on Commonsense Axioms. We generate data for this challenge using commonsense knowledge bases and probe PTLMs across two different evaluation settings. Experiments show that PTLMs perform no better than random guessing on the zero-shot setting, are heavily impacted by statistical biases, and are not robust to perturbation attacks.
Score: 41.82685006832153
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained language models (PTLMs) have achieved impressive performance on commonsense inference benchmarks, but their ability to employ commonsense to make robust inferences, which is crucial for effective communications with humans, is debated. In the pursuit of advancing fluid human-AI communication, we propose a new challenge, RICA: Robust Inference capability based on Commonsense Axioms, that evaluates robust commonsense inference despite textual perturbations. To generate data for this challenge, we develop a systematic and scalable procedure using commonsense knowledge bases and probe PTLMs across two different evaluation settings. Extensive experiments on our generated probe sets with more than 10k statements show that PTLMs perform no better than random guessing on the zero-shot setting, are heavily impacted by statistical biases, and are not robust to perturbation attacks. We also find that fine-tuning on similar statements offer limited gains, as PTLMs still fail to generalize to unseen inferences. Our new large-scale benchmark exposes a significant gap between PTLMs and human-level language understanding and offers a new challenge for PTLMs to demonstrate commonsense.

Related papers

Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing [0.0]
We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by reliability engineering.<n>APST repeatedly samples identical prompts under controlled operational conditions to surface latent failure modes.<n>We find that models with similar benchmark-aligned scores can exhibit substantially different empirical failure rates under repeated sampling.
arXiv Detail & Related papers (2026-02-12T10:09:13Z)
On the Paradoxical Interference between Instruction-Following and Task Solving [50.75960598434753]
Instruction following aims to align Large Language Models (LLMs) with human intent by specifying explicit constraints on how tasks should be performed.<n>We reveal a counterintuitive phenomenon: instruction following can paradoxically interfere with LLMs' task-solving capability.<n>We propose a metric, SUSTAINSCORE, to quantify the interference of instruction following with task solving.
arXiv Detail & Related papers (2026-01-29T17:48:56Z)
LANPO: Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs [73.27182315028021]
LANPO is a framework that cleanly separates the roles of feedback: language guides exploration, while numerical rewards drive optimization.<n>Our work provides a robust method for integrating historical experiences into the LLM RL loop, creating more effective and data-efficient learning agents.
arXiv Detail & Related papers (2025-10-18T15:51:19Z)
Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions? [36.957333458197034]
Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia.<n>We propose Inverse IFEval, a benchmark that measures models' capacity to override training-induced biases and comply with adversarial instructions.
arXiv Detail & Related papers (2025-09-04T15:03:02Z)
Real-World Summarization: When Evaluation Reaches Its Limits [1.4197924572122094]
We compare traditional metrics, trainable methods, and LLM-as-a-judge approaches.<n>Our findings reveal that simpler metrics like word overlap surprisingly well with human judgments.<n>Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks.
arXiv Detail & Related papers (2025-07-15T17:23:56Z)
Mitigating Hidden Confounding by Progressive Confounder Imputation via Large Language Models [46.92706900119399]
We make the first attempt to mitigate hidden confounding using large language models (LLMs)<n>We propose ProCI, a framework that elicits the semantic and world knowledge of LLMs to iteratively generate, impute, and validate hidden confounders.<n>Extensive experiments demonstrate that ProCI uncovers meaningful confounders and significantly improves treatment effect estimation.
arXiv Detail & Related papers (2025-06-26T03:49:13Z)
Towards Robust LLMs: an Adversarial Robustness Measurement Framework [0.0]
Large Language Models (LLMs) remain vulnerable to adversarial perturbations, undermining their reliability in high-stakes applications. We adapt the Robustness Measurement and Assessment framework to quantify LLM resilience against adversarial inputs without requiring access to model parameters. Our work provides a systematic methodology to assess LLM robustness, advancing the development of more reliable language models for real-world deployment.
arXiv Detail & Related papers (2025-04-24T16:36:19Z)
A Debate-Driven Experiment on LLM Hallucinations and Accuracy [7.821303946741665]
This study investigates the phenomenon of hallucination in large language models (LLMs) Multiple instances of GPT-4o-Mini models engage in a debate-like interaction prompted with questions from the TruthfulQA dataset. One model is deliberately instructed to generate plausible but false answers while the other models are asked to respond truthfully.
arXiv Detail & Related papers (2024-10-25T11:41:27Z)
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models [12.112914393948415]
We present RUPBench, a benchmark designed to evaluate large language models (LLMs) across diverse reasoning tasks. Our benchmark incorporates 15 reasoning datasets, categorized into commonsense, arithmetic, logical, and knowledge-intensive reasoning. By examining the performance of state-of-the-art LLMs such as GPT-4o, Llama3, Phi-3, and Gemma on both original and perturbed datasets, we provide a detailed analysis of their robustness and error patterns.
arXiv Detail & Related papers (2024-06-16T17:26:44Z)
Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models. It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Exploring the Physical World Adversarial Robustness of Vehicle Detection [13.588120545886229]
Adrial attacks can compromise the robustness of real-world detection models. We propose an innovative instant-level data generation pipeline using the CARLA simulator. Our findings highlight diverse model performances under adversarial conditions.
arXiv Detail & Related papers (2023-08-07T11:09:12Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
Fair Robust Active Learning by Joint Inconsistency [22.150782414035422]
We introduce a novel task, Fair Robust Active Learning (FRAL), integrating conventional FAL and adversarial robustness. We develop a simple yet effective FRAL strategy by Joint INconsistency (JIN) Our method exploits the prediction inconsistency between benign and adversarial samples as well as between standard and robust models.
arXiv Detail & Related papers (2022-09-22T01:56:41Z)
Evaluate Confidence Instead of Perplexity for Zero-shot Commonsense Reasoning [85.1541170468617]
This paper reconsiders the nature of commonsense reasoning and proposes a novel commonsense reasoning metric, Non-Replacement Confidence (NRC) Our proposed novel method boosts zero-shot performance on two commonsense reasoning benchmark datasets and further seven commonsense question-answering datasets.
arXiv Detail & Related papers (2022-08-23T14:42:14Z)
Characterizing the adversarial vulnerability of speech self-supervised learning [95.03389072594243]
We make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries. The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries.
arXiv Detail & Related papers (2021-11-08T08:44:04Z)
A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation [53.8171136907856]
We introduce a set of simple yet effective data augmentation strategies dubbed cutoff. cutoff relies on sampling consistency and thus adds little computational overhead. cutoff consistently outperforms adversarial training and achieves state-of-the-art results on the IWSLT2014 German-English dataset.
arXiv Detail & Related papers (2020-09-29T07:08:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.