Related papers: A practical generalization metric for deep networks benchmarking

A practical generalization metric for deep networks benchmarking

URL: http://arxiv.org/abs/2409.01498v1
Date: Mon, 2 Sep 2024 23:38:25 GMT
Title: A practical generalization metric for deep networks benchmarking
Authors: Mengqing Huang, Hongchuan Yu, Jianjun Zhang,
Abstract summary: This paper introduces a practical generalization metric for benchmarking different deep networks and proposes a novel testbed for the verification of theoretical estimations. Our findings indicate that a deep network's generalization capacity in classification tasks is contingent upon both classification accuracy and the diversity of unseen data. It is discouraging to note that most of the available generalization estimations do not correlate with the practical measurements obtained using our proposed practical metric.
Score: 4.111474233685893
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There is an ongoing and dedicated effort to estimate bounds on the generalization error of deep learning models, coupled with an increasing interest with practical metrics that can be used to experimentally evaluate a model's ability to generalize. This interest is not only driven by practical considerations but is also vital for theoretical research, as theoretical estimations require practical validation. However, there is currently a lack of research on benchmarking the generalization capacity of various deep networks and verifying these theoretical estimations. This paper aims to introduce a practical generalization metric for benchmarking different deep networks and proposes a novel testbed for the verification of theoretical estimations. Our findings indicate that a deep network's generalization capacity in classification tasks is contingent upon both classification accuracy and the diversity of unseen data. The proposed metric system is capable of quantifying the accuracy of deep learning models and the diversity of data, providing an intuitive and quantitative evaluation method, a trade-off point. Furthermore, we compare our practical metric with existing generalization theoretical estimations using our benchmarking testbed. It is discouraging to note that most of the available generalization estimations do not correlate with the practical measurements obtained using our proposed practical metric. On the other hand, this finding is significant as it exposes the shortcomings of theoretical estimations and inspires new exploration.

Related papers

Takeuchi's Information Criteria as Generalization Measures for DNNs Close to NTK Regime [56.89793618576349]
Generalization measures have been studied extensively in the machine learning community to better characterize generalization gaps.<n>This study focuses on Takeuchi's information criterion (TIC) to investigate the conditions under which this classical measure can effectively explain the generalization gaps of deep neural networks (DNNs)
arXiv Detail & Related papers (2026-02-26T17:01:14Z)
Theory Trace Card: Theory-Driven Socio-Cognitive Evaluation of LLMs [2.98033672654447]
We argue that many socio-cognitive evaluations proceed without an explicit theoretical specification of the target capability.<n>Without this theoretical grounding, benchmarks that exercise only narrow subsets of a capability are routinely misinterpreted as evidence of broad competence.<n>We introduce the Trace Card, a lightweight documentation artifact designed to accompany socio-cognitive evaluations.
arXiv Detail & Related papers (2026-01-05T08:06:50Z)
The Benchmarking Epistemology: Construct Validity for Evaluating Machine Learning Models [1.1315617886931963]
We develop conditions of construct validity inspired by psychological measurement theory.<n>We examine these assumptions in practice through three case studies.<n>Our framework clarifies conditions under which benchmark scores can support diverse scientific claims.
arXiv Detail & Related papers (2025-10-27T10:30:30Z)
Manifold Dimension Estimation: An Empirical Study [0.0]
The manifold hypothesis suggests that high-dimensional data often lie on or near a low-dimensional manifold.<n>Estimating the dimension of this manifold is essential for leveraging its structure.<n>This article provides a comprehensive survey for both researchers and practitioners.
arXiv Detail & Related papers (2025-09-19T01:48:58Z)
The Shape of Generalization through the Lens of Norm-based Capacity Control [20.88908358215574]
We consider norm-based capacity measures and develop our study for random features based estimators.<n>We provide a precise characterization of how the estimator's norm concentrates and how it governs the associated test error.<n>This confirms that more classical U-shaped behavior is recovered considering appropriate capacity measures based on models norms rather than size.
arXiv Detail & Related papers (2025-02-03T18:10:40Z)
PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines [86.36060279469304]
We introduce PredBench, a benchmark tailored for the holistic evaluation of prediction-temporal networks. This benchmark integrates 12 widely adopted methods with diverse datasets across multiple application domains. Its multi-dimensional evaluation framework broadens the analysis with a comprehensive set of metrics.
arXiv Detail & Related papers (2024-07-11T11:51:36Z)
Empirical Tests of Optimization Assumptions in Deep Learning [41.05664717242051]
This paper develops new empirical metrics to track the key quantities that must be controlled in theoretical analysis. All of our tested assumptions fail to reliably capture optimization performance. This highlights a need for new empirical verification of analytical assumptions used in theoretical analysis.
arXiv Detail & Related papers (2024-07-01T21:56:54Z)
Calibration-then-Calculation: A Variance Reduced Metric Framework in Deep Click-Through Rate Prediction Models [16.308958212406583]
There is a lack of focus on evaluating the performance of deep learning pipelines. With the increased use of large datasets and complex models, the training process is run only once and the result is compared to previous benchmarks. Traditional solutions, such as running the training process multiple times, are often infeasible due to computational constraints. We introduce a novel metric framework, the Calibrated Loss Metric, designed to address this issue by reducing the variance present in its conventional counterpart.
arXiv Detail & Related papers (2024-01-30T02:38:23Z)
A Theoretical and Practical Framework for Evaluating Uncertainty Calibration in Object Detection [1.8843687952462744]
This work presents a novel theoretical and practical framework to evaluate object detection systems in the context of uncertainty calibration. The robustness of the proposed uncertainty calibration metrics is shown through a series of representative experiments.
arXiv Detail & Related papers (2023-09-01T14:02:44Z)
Advancing Counterfactual Inference through Nonlinear Quantile Regression [77.28323341329461]
We propose a framework for efficient and effective counterfactual inference implemented with neural networks. The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data. Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.
arXiv Detail & Related papers (2023-06-09T08:30:51Z)
A Theoretical-Empirical Approach to Estimating Sample Complexity of DNNs [11.152761263415046]
This paper focuses on understanding how the generalization error scales with the amount of the training data for deep neural networks (DNNs) We derive estimates of the generalization error that hold for deep networks and do not rely on unattainable capacity measures.
arXiv Detail & Related papers (2021-05-05T05:14:08Z)
Metrics and continuity in reinforcement learning [34.10996560464196]
We introduce a unified formalism for defining topologies through the lens of metrics. We establish a hierarchy amongst these metrics and demonstrate their theoretical implications on the Markov Decision Process. We complement our theoretical results with empirical evaluations showcasing the differences between the metrics considered.
arXiv Detail & Related papers (2021-02-02T14:30:41Z)
Margin-Based Transfer Bounds for Meta Learning with Deep Feature Embedding [67.09827634481712]
We leverage margin theory and statistical learning theory to establish three margin-based transfer bounds for meta-learning based multiclass classification (MLMC) These bounds reveal that the expected error of a given classification algorithm for a future task can be estimated with the average empirical error on a finite number of previous tasks. Experiments on three benchmarks show that these margin-based models still achieve competitive performance.
arXiv Detail & Related papers (2020-12-02T23:50:51Z)
In Search of Robust Measures of Generalization [79.75709926309703]
We develop bounds on generalization error, optimization error, and excess risk. When evaluated empirically, most of these bounds are numerically vacuous. We argue that generalization measures should instead be evaluated within the framework of distributional robustness.
arXiv Detail & Related papers (2020-10-22T17:54:25Z)
Performance metrics for intervention-triggering prediction models do not reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models. Standard metrics calculated from retrospective data are only related to model utility under certain assumptions. When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z)
On the uncertainty of self-supervised monocular depth estimation [52.13311094743952]
Self-supervised paradigms for monocular depth estimation are very appealing since they do not require ground truth annotations at all. We explore for the first time how to estimate the uncertainty for this task and how this affects depth accuracy. We propose a novel peculiar technique specifically designed for self-supervised approaches.
arXiv Detail & Related papers (2020-05-13T09:00:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.