Do Generalisation Results Generalise?
- URL: http://arxiv.org/abs/2512.07832v1
- Date: Mon, 08 Dec 2025 18:59:51 GMT
- Title: Do Generalisation Results Generalise?
- Authors: Matteo Boglioni, Andrea Sgobbi, Gabriel Tavernini, Francesco Rita, Marius Mosbach, Tiago Pimentel,
- Abstract summary: We evaluate a model's performance across multiple OOD testsets throughout a fine run.<n>We then evaluate the partial correlation of performances across these testsets, regressing out in-domain performance.<n>Analysing OLMo2 and OPT, we observe no overarching trend in generalisation results.
- Score: 19.855708462203097
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A large language model's (LLM's) out-of-distribution (OOD) generalisation ability is crucial to its deployment. Previous work assessing LLMs' generalisation performance, however, typically focuses on a single out-of-distribution dataset. This approach may fail to precisely evaluate the capabilities of the model, as the data shifts encountered once a model is deployed are much more diverse. In this work, we investigate whether OOD generalisation results generalise. More specifically, we evaluate a model's performance across multiple OOD testsets throughout a finetuning run; we then evaluate the partial correlation of performances across these testsets, regressing out in-domain performance. This allows us to assess how correlated are generalisation performances once in-domain performance is controlled for. Analysing OLMo2 and OPT, we observe no overarching trend in generalisation results: the existence of a positive or negative correlation between any two OOD testsets depends strongly on the specific choice of model analysed.
Related papers
- Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering [4.123456708238846]
A majority of recent work in AI assesses models' generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets.<n>We challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models.<n>We find that different datasets used for OOD evaluations in QA provide an estimate of models' robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation.
arXiv Detail & Related papers (2025-08-25T18:49:50Z) - Can Interpretation Predict Behavior on Unseen Data? [11.280404893713213]
Interpretability research often aims to predict how a model will respond to targeted interventions on specific mechanisms.<n>This paper explores the promises and challenges of interpretability as a tool for predicting out-of-distribution model behavior.
arXiv Detail & Related papers (2025-07-08T23:07:33Z) - Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z) - Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis [36.689210473887904]
We introduce a benchmarking framework for evaluating cross-dataset prediction generalization in deep learning (DL) and machine learning (ML) models.<n>We quantify both absolute performance (e.g., predictive accuracy across datasets) and relative performance (e.g., performance drop compared to within-dataset results)<n>Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments.
arXiv Detail & Related papers (2025-03-18T15:40:18Z) - Negative as Positive: Enhancing Out-of-distribution Generalization for Graph Contrastive Learning [60.61079931266331]
We propose a novel strategy "Negative as Positive", where the most semantically similar cross-domain negative pairs are treated as positive during Graph contrastive learning (GCL)
Our experimental results, spanning a wide array of datasets, confirm that this method substantially improves the OOD generalization performance of GCL.
arXiv Detail & Related papers (2024-05-25T13:29:31Z) - Principles from Clinical Research for NLP Model Generalization [10.985226652193543]
We explore the foundations of generalizability and study the factors that affect it.
We demonstrate how learning spurious correlations, such as the distance between entities in relation extraction tasks, can affect a model's internal validity.
arXiv Detail & Related papers (2023-11-07T02:17:25Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Cross-functional Analysis of Generalisation in Behavioural Learning [4.0810783261728565]
We introduce BeLUGA, an analysis method for evaluating behavioural learning considering generalisation across dimensions of different levels.
An aggregate score measures generalisation to unseen functionalities (or overfitting)
arXiv Detail & Related papers (2023-05-22T11:54:19Z) - Guide the Learner: Controlling Product of Experts Debiasing Method Based
on Token Attribution Similarities [17.082695183953486]
A popular workaround is to train a robust model by re-weighting training examples based on a secondary biased model.
Here, the underlying assumption is that the biased model resorts to shortcut features.
We introduce a fine-tuning strategy that incorporates the similarity between the main and biased model attribution scores in a Product of Experts loss function.
arXiv Detail & Related papers (2023-02-06T15:21:41Z) - General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space.
GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z) - Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues.
We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders.
We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z) - Rethinking Generalization of Neural Models: A Named Entity Recognition
Case Study [81.11161697133095]
We take the NER task as a testbed to analyze the generalization behavior of existing models from different perspectives.
Experiments with in-depth analyses diagnose the bottleneck of existing neural NER models.
As a by-product of this paper, we have open-sourced a project that involves a comprehensive summary of recent NER papers.
arXiv Detail & Related papers (2020-01-12T04:33:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.