Related papers: Beyond ImageNet: Understanding Cross-Dataset Robustness of Lightweight Vision Models

Beyond ImageNet: Understanding Cross-Dataset Robustness of Lightweight Vision Models

URL: http://arxiv.org/abs/2511.00335v1
Date: Sat, 01 Nov 2025 00:40:06 GMT
Title: Beyond ImageNet: Understanding Cross-Dataset Robustness of Lightweight Vision Models
Authors: Weidong Zhang, Pak Lun Kevin Ding, Huan Liu,
Abstract summary: We present the first systematic evaluation of 11 lightweight vision models (2.5M parameters) trained under a fixed 100-epoch schedule across 7 diverse datasets.<n>We introduce the Cross-Dataset Score (xScore), a unified metric that quantifies the consistency and robustness of model performance across diverse visual domains.
Score: 13.660350750023055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Lightweight vision classification models such as MobileNet, ShuffleNet, and EfficientNet are increasingly deployed in mobile and embedded systems, yet their performance has been predominantly benchmarked on ImageNet. This raises critical questions: Do models that excel on ImageNet also generalize across other domains? How can cross-dataset robustness be systematically quantified? And which architectural elements consistently drive generalization under tight resource constraints? Here, we present the first systematic evaluation of 11 lightweight vision models (2.5M parameters), trained under a fixed 100-epoch schedule across 7 diverse datasets. We introduce the Cross-Dataset Score (xScore), a unified metric that quantifies the consistency and robustness of model performance across diverse visual domains. Our results show that (1) ImageNet accuracy does not reliably predict performance on fine-grained or medical datasets, (2) xScore provides a scalable predictor of mobile model performance that can be estimated from just four datasets, and (3) certain architectural components--such as isotropic convolutions with higher spatial resolution and channel-wise attention--promote broader generalization, while Transformer-based blocks yield little additional benefit, despite incurring higher parameter overhead. This study provides a reproducible framework for evaluating lightweight vision models beyond ImageNet, highlights key design principles for mobile-friendly architectures, and guides the development of future models that generalize robustly across diverse application domains.

Related papers

ScaleNet: Scaling up Pretrained Neural Networks with Incremental Parameters [67.87703790962388]
We introduce ScaleNet, an efficient approach for scaling vision transformers (ViTs)<n>Unlike conventional training from scratch, ScaleNet facilitates rapid model expansion with negligible increases in parameters.<n>We show that ScaleNet achieves a 7.42% accuracy improvement over training from scratch while requiring only one-third of the training epochs.
arXiv Detail & Related papers (2025-10-21T09:07:25Z)
A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation [3.5684665108045377]
Vision-transformers (ViTs) and large-scale convolution-neural-networks (CNNs) have reshaped computer vision through pretrained feature representations.<n>This work considers two such tasks: 1) estimating 2D rigid transformations between pairs of images and 2) predicting the fundamental matrix for stereo image pairs.<n> Empirical comparative analysis shows that, similar to training from scratch, ViTs outperform CNNs during refinement in large downstream-data scenarios.
arXiv Detail & Related papers (2025-10-06T13:18:27Z)
LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence [61.46575527504109]
LimiX-16M and LimiX-2M treat structured data as a joint distribution over variables and missingness.<n>We evaluate LimiX models across 11 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios.
arXiv Detail & Related papers (2025-09-03T17:39:08Z)
Comparative Analysis of Lightweight Deep Learning Models for Memory-Constrained Devices [0.0]
Five state-of-the-art architectures are benchmarked across three diverse datasets: CIFAR-10, CIFAR-100, and Tiny ImageNet.<n>The models are assessed using four key performance metrics: classification accuracy, inference time, floating-point operations (FLOPs), and model size.
arXiv Detail & Related papers (2025-05-06T08:36:01Z)
Self-supervised Benchmark Lottery on ImageNet: Do Marginal Improvements Translate to Improvements on Similar Datasets? [1.3821203559674384]
We investigate whether models that seem to perform well on ImageNet may experience significant performance declines on similar datasets.<n>Specifically, state-of-the-art frameworks such as DINO and Swav, which are praised for their performance, exhibit substantial drops in performance.<n>We argue that otherwise good and desirable properties of models remain hidden when benchmarking is only performed on the ImageNet validation set.
arXiv Detail & Related papers (2025-01-26T07:19:12Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
Exploring the design space of deep-learning-based weather forecasting systems [56.129148006412855]
This paper systematically analyzes the impact of different design choices on deep-learning-based weather forecasting systems. We study fixed-grid architectures such as UNet, fully convolutional architectures, and transformer-based models. We propose a hybrid system that combines the strong performance of fixed-grid models with the flexibility of grid-invariant architectures.
arXiv Detail & Related papers (2024-10-09T22:25:50Z)
Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration [100.54419875604721]
All-in-one image restoration tackles different types of degradations with a unified model instead of having task-specific, non-generic models for each degradation. We propose DyNet, a dynamic family of networks designed in an encoder-decoder style for all-in-one image restoration tasks. Our DyNet can seamlessly switch between its bulkier and lightweight variants, thereby offering flexibility for efficient model deployment.
arXiv Detail & Related papers (2024-04-02T17:58:49Z)
ComFe: An Interpretable Head for Vision Transformers [8.572967695281054]
Interpretable computer vision models explain their classifications through comparing distances between the local annotations of an image and a set of prototypes that represent the training data.<n>ComFe is the first interpretable approach we know of, and unlike other interpretable approaches, can be readily applied as imageNet Image-1K.
arXiv Detail & Related papers (2024-03-07T00:44:21Z)
Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks. Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention. Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z)
Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets. This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets. We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
Tidying Deep Saliency Prediction Architectures [6.613005108411055]
In this paper, we identify four key components of saliency models, i.e., input features, multi-level integration, readout architecture, and loss functions. We propose two novel end-to-end architectures called SimpleNet and MDNSal, which are neater, minimal, more interpretable and achieve state of the art performance on public saliency benchmarks.
arXiv Detail & Related papers (2020-03-10T19:34:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.