Analysis of Hyperparameter Optimization Effects on Lightweight Deep Models for Real-Time Image Classification
- URL: http://arxiv.org/abs/2507.23315v2
- Date: Thu, 16 Oct 2025 13:29:58 GMT
- Title: Analysis of Hyperparameter Optimization Effects on Lightweight Deep Models for Real-Time Image Classification
- Authors: Vineet Kumar Rakesh, Soumya Mazumdar, Tapas Samanta, Hemendra Kumar Pandey, Amitabha Das,
- Abstract summary: This study evaluates the accuracy and deployment feasibility of seven modern lightweight architectures: ConvNeXt-T, EfficientV2-S, MobileNetV3-L, MobileViT v2 (S/XS), RepVGG-A2, and TinyViT-21M.<n> tuning alone leads to a top-1 accuracy improvement of 1.5 to 3.5 percent over baselines, and select models deliver latency under 5 milliseconds and over 9,800 frames per second.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Lightweight convolutional and transformer-based networks are increasingly preferred for real-time image classification, especially on resource-constrained devices. This study evaluates the impact of hyperparameter optimization on the accuracy and deployment feasibility of seven modern lightweight architectures: ConvNeXt-T, EfficientNetV2-S, MobileNetV3-L, MobileViT v2 (S/XS), RepVGG-A2, and TinyViT-21M, trained on a class-balanced subset of 90,000 images from ImageNet-1K. Under standardized training settings, this paper investigates the influence of learning rate schedules, augmentation, optimizers, and initialization on model performance. Inference benchmarks are performed using an NVIDIA L40s GPU with batch sizes ranging from 1 to 512, capturing latency and throughput in real-time conditions. This work demonstrates that controlled hyperparameter variation significantly alters convergence dynamics in lightweight CNN and transformer backbones, providing insight into stability regions and deployment feasibility in edge artificial intelligence. Our results reveal that tuning alone leads to a top-1 accuracy improvement of 1.5 to 3.5 percent over baselines, and select models (e.g., RepVGG-A2, MobileNetV3-L) deliver latency under 5 milliseconds and over 9,800 frames per second, making them ideal for edge deployment. This work provides reproducible, subset-based insights into lightweight hyperparameter tuning and its role in balancing speed and accuracy. The code and logs may be seen at: https://vineetkumarrakesh.github.io/lcnn-opt
Related papers
- AHDMIL: Asymmetric Hierarchical Distillation Multi-Instance Learning for Fast and Accurate Whole-Slide Image Classification [51.525891360380285]
AHDMIL is an Asymmetric Hierarchical Distillation Multi-Instance Learning framework.<n>It eliminates irrelevant patches through a two-step training process.<n>It consistently outperforms previous state-of-the-art methods in both classification performance and inference speed.
arXiv Detail & Related papers (2025-08-07T07:47:16Z) - Enhancing Training Data Attribution with Representational Optimization [57.61977909113113]
Training data attribution methods aim to measure how training data impacts a model's predictions.<n>We propose AirRep, a representation-based approach that closes this gap by learning task-specific and model-aligned representations explicitly for TDA.<n>AirRep introduces two key innovations: a trainable encoder tuned for attribution quality, and an attention-based pooling mechanism that enables accurate estimation of group-wise influence.
arXiv Detail & Related papers (2025-05-24T05:17:53Z) - Comparative Analysis of Lightweight Deep Learning Models for Memory-Constrained Devices [0.0]
Five state-of-the-art architectures are benchmarked across three diverse datasets: CIFAR-10, CIFAR-100, and Tiny ImageNet.<n>The models are assessed using four key performance metrics: classification accuracy, inference time, floating-point operations (FLOPs), and model size.
arXiv Detail & Related papers (2025-05-06T08:36:01Z) - Ultra-Resolution Adaptation with Ease [62.56434979517156]
We propose a set of key guidelines for ultra-resolution adaptation termed emphURAE.<n>We show that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable.<n>Experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations.
arXiv Detail & Related papers (2025-03-20T16:44:43Z) - Building Efficient Lightweight CNN Models [0.0]
Convolutional Neural Networks (CNNs) are pivotal in image classification tasks due to their robust feature extraction capabilities.<n>This paper introduces a methodology to construct lightweight CNNs while maintaining competitive accuracy.<n>The proposed model achieved a state-of-the-art accuracy of 99% on the handwritten digit MNIST and 89% on fashion MNIST, with only 14,862 parameters and a model size of 0.17 MB.
arXiv Detail & Related papers (2025-01-26T14:39:01Z) - A hybrid framework for effective and efficient machine unlearning [12.499101994047862]
Machine unlearning (MU) is proposed to remove the imprints of revoked samples from the already trained model parameters.<n>We present a novel hybrid strategy on top of them to achieve an overall success.
arXiv Detail & Related papers (2024-12-19T03:59:26Z) - RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization [8.346566205092433]
lightweight Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are favored for their parameter efficiency and low latency.
This study investigates the complementary advantages of CNNs and ViTs to develop a versatile vision backbone tailored for resource-constrained applications.
arXiv Detail & Related papers (2024-06-23T04:11:12Z) - REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning [23.92661395403251]
Recent rehearsal-free methods, guided by prompts, excel in vision-related continual learning (CL) with drifting data but lack resource efficiency.<n>We introduce Resource-Efficient Prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free methods.<n>Our approach employs swift prompt selection to refine input data using a carefully provisioned model.
arXiv Detail & Related papers (2024-06-07T09:17:33Z) - ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding [9.144813021145039]
This paper introduces ParFormer, a vision transformer that incorporates a Parallel Mixer and a Sparse Channel Attention Patch Embedding (SCAPE)
ParFormer improves feature extraction by combining convolutional and attention mechanisms.
For edge device deployment, ParFormer-T excels with a throughput of 278.1 images/sec, which is 1.38 $times$ higher than EdgeNeXt-S.
The larger variant, ParFormer-L, reaches 83.5% Top-1 accuracy, offering a balanced trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-22T07:32:21Z) - SchurVINS: Schur Complement-Based Lightweight Visual Inertial Navigation System [8.017085402991189]
We propose a novel filter-based VINS framework named SchurVINS.
It could guarantee both high accuracy by building a complete residual model and low computational complexity.
Experiments on EuRoC and TUM-VI datasets show that our method notably outperforms state-of-the-art (SOTA) methods in both accuracy and computational complexity.
arXiv Detail & Related papers (2023-12-04T04:14:09Z) - FastViT: A Fast Hybrid Vision Transformer using Structural
Reparameterization [14.707312504365376]
We introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off.
We show that our model is 3.5x faster than CMT, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2023-03-24T17:58:32Z) - Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency.
We also introduce a novel fine-grained joint search strategy for transformer models.
This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z) - Incremental Online Learning Algorithms Comparison for Gesture and Visual
Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification.
Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z) - EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for
Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups.
Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K.
Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - A Fast and Efficient Conditional Learning for Tunable Trade-Off between
Accuracy and Robustness [11.35810118757863]
Existing models that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on convolution operations conditioned with feature-wise linear modulation (FiLM) layers.
We present a fast learnable once-for-all adversarial training (FLOAT) algorithm, which instead of the existing FiLM-based conditioning, presents a unique weight conditioned learning that requires no additional layer.
In particular, we add scaled noise to the weight tensors that enables a trade-off between clean and adversarial performance.
arXiv Detail & Related papers (2022-03-28T19:25:36Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - Early Convolutions Help Transformers See Better [63.21712652156238]
Vision transformer (ViT) models exhibit substandard optimizability.
Modern convolutional neural networks are far easier to optimize.
Using a convolutional stem in ViT dramatically increases optimization stability and also improves peak performance.
arXiv Detail & Related papers (2021-06-28T17:59:33Z) - EfficientNetV2: Smaller Models and Faster Training [91.77432224225221]
This paper introduces EfficientNetV2, a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models.
We use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency.
Our experiments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller.
arXiv Detail & Related papers (2021-04-01T07:08:36Z) - Towards Fast, Accurate and Stable 3D Dense Face Alignment [73.01620081047336]
We propose a novel regression framework named 3DDFA-V2 which makes a balance among speed, accuracy and stability.
We present a virtual synthesis method to transform one still image to a short-video which incorporates in-plane and out-of-plane face moving.
arXiv Detail & Related papers (2020-09-21T15:37:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.