Cross-functional Analysis of Generalisation in Behavioural Learning
- URL: http://arxiv.org/abs/2305.12951v1
- Date: Mon, 22 May 2023 11:54:19 GMT
- Title: Cross-functional Analysis of Generalisation in Behavioural Learning
- Authors: Pedro Henrique Luz de Araujo and Benjamin Roth
- Abstract summary: We introduce BeLUGA, an analysis method for evaluating behavioural learning considering generalisation across dimensions of different levels.
An aggregate score measures generalisation to unseen functionalities (or overfitting)
- Score: 4.0810783261728565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In behavioural testing, system functionalities underrepresented in the
standard evaluation setting (with a held-out test set) are validated through
controlled input-output pairs. Optimising performance on the behavioural tests
during training (behavioural learning) would improve coverage of phenomena not
sufficiently represented in the i.i.d. data and could lead to seemingly more
robust models. However, there is the risk that the model narrowly captures
spurious correlations from the behavioural test suite, leading to
overestimation and misrepresentation of model performance -- one of the
original pitfalls of traditional evaluation. In this work, we introduce BeLUGA,
an analysis method for evaluating behavioural learning considering
generalisation across dimensions of different granularity levels. We optimise
behaviour-specific loss functions and evaluate models on several partitions of
the behavioural test suite controlled to leave out specific phenomena. An
aggregate score measures generalisation to unseen functionalities (or
overfitting). We use BeLUGA to examine three representative NLP tasks
(sentiment analysis, paraphrase identification and reading comprehension) and
compare the impact of a diverse set of regularisation and domain generalisation
methods on generalisation performance.
Related papers
- An Auditing Test To Detect Behavioral Shift in Language Models [28.52295230939529]
We present a method for continual Behavioral Shift Auditing (BSA) in language models.
BSA detects behavioral shifts solely through model generations.
We find that the test is able to detect meaningful changes in behavior distributions using just hundreds of examples.
arXiv Detail & Related papers (2024-10-25T09:09:31Z) - Generalization Ability of Feature-based Performance Prediction Models: A Statistical Analysis across Benchmarks [5.170967632369504]
We compare the statistical similarity between the problem collections with the accuracy of performance prediction models based on exploratory landscape analysis features.
We observe that there is a positive correlation between these two measures.
Specifically, when the high-dimensional feature value distributions between training and testing suites lack statistical significance, the model tends to generalize well.
arXiv Detail & Related papers (2024-05-20T12:39:24Z) - Preserving Silent Features for Domain Generalization [6.568921669414849]
Self-supervised contrastive learning pre-trained models do not exhibit better generalization performance than supervised models pre-trained on the same dataset in the DG setting.
We propose a simple yet effective method termed STEP (Silent Feature Preservation) to improve the generalization performance of the self-supervised contrastive learning pre-trained model.
arXiv Detail & Related papers (2024-01-06T09:11:41Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Assessing the Generalizability of a Performance Predictive Model [0.6070952062639761]
We propose a workflow to estimate the generalizability of a predictive model for algorithm performance.
The results show that generalizability patterns in the landscape feature space are reflected in the performance space.
arXiv Detail & Related papers (2023-05-31T12:50:44Z) - Modeling Uncertain Feature Representation for Domain Generalization [49.129544670700525]
We show that our method consistently improves the network generalization ability on multiple vision tasks.
Our methods are simple yet effective and can be readily integrated into networks without additional trainable parameters or loss constraints.
arXiv Detail & Related papers (2023-01-16T14:25:02Z) - SimSCOOD: Systematic Analysis of Out-of-Distribution Generalization in
Fine-tuned Source Code Models [58.78043959556283]
We study the behaviors of models under different fine-tuning methodologies, including full fine-tuning and Low-Rank Adaptation (LoRA) fine-tuning methods.
Our analysis uncovers that LoRA fine-tuning consistently exhibits significantly better OOD generalization performance than full fine-tuning across various scenarios.
arXiv Detail & Related papers (2022-10-10T16:07:24Z) - Checking HateCheck: a cross-functional analysis of behaviour-aware
learning for hate speech detection [4.0810783261728565]
We investigate fine-tuning schemes using HateCheck, a suite of functional tests for hate speech detection systems.
We train and evaluate models on different configurations of HateCheck by holding out categories of test cases.
The fine-tuning procedure led to improvements in the classification accuracy of held-out functionalities and identity groups.
However, performance on held-out functionality classes and i.i.d. hate speech detection data decreased, which indicates that generalisation occurs mostly across functionalities from the same class.
arXiv Detail & Related papers (2022-04-08T13:03:01Z) - Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues.
We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders.
We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z) - CASTLE: Regularization via Auxiliary Causal Graph Discovery [89.74800176981842]
We introduce Causal Structure Learning (CASTLE) regularization and propose to regularize a neural network by jointly learning the causal relationships between variables.
CASTLE efficiently reconstructs only the features in the causal DAG that have a causal neighbor, whereas reconstruction-based regularizers suboptimally reconstruct all input features.
arXiv Detail & Related papers (2020-09-28T09:49:38Z) - Rethinking Generalization of Neural Models: A Named Entity Recognition
Case Study [81.11161697133095]
We take the NER task as a testbed to analyze the generalization behavior of existing models from different perspectives.
Experiments with in-depth analyses diagnose the bottleneck of existing neural NER models.
As a by-product of this paper, we have open-sourced a project that involves a comprehensive summary of recent NER papers.
arXiv Detail & Related papers (2020-01-12T04:33:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.