Accelerating Local AI on Consumer GPUs: A Hardware-Aware Dynamic Strategy for YOLOv10s
- URL: http://arxiv.org/abs/2509.07928v1
- Date: Tue, 09 Sep 2025 17:13:31 GMT
- Title: Accelerating Local AI on Consumer GPUs: A Hardware-Aware Dynamic Strategy for YOLOv10s
- Authors: Mahmudul Islam Masum, Miad Islam, Arif I. Sarwat,
- Abstract summary: We introduce a Two-Pass Adaptive Inference algorithm, a model-independent approach that requires no architectural changes.<n>On a 5000-image COCO dataset, our method achieves a 1.85x speedup over a PyTorch Early-Exit baseline, with a modest mAP loss of 5.51%.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As local AI grows in popularity, there is a critical gap between the benchmark performance of object detectors and their practical viability on consumer-grade hardware. While models like YOLOv10s promise real-time speeds, these metrics are typically achieved on high-power, desktop-class GPUs. This paper reveals that on resource-constrained systems, such as laptops with RTX 4060 GPUs, performance is not compute-bound but is instead dominated by system-level bottlenecks, as illustrated by a simple bottleneck test. To overcome this hardware-level constraint, we introduce a Two-Pass Adaptive Inference algorithm, a model-independent approach that requires no architectural changes. This study mainly focuses on adaptive inference strategies and undertakes a comparative analysis of architectural early-exit and resolution-adaptive routing, highlighting their respective trade-offs within a unified evaluation framework. The system uses a fast, low-resolution pass and only escalates to a high-resolution model pass when detection confidence is low. On a 5000-image COCO dataset, our method achieves a 1.85x speedup over a PyTorch Early-Exit baseline, with a modest mAP loss of 5.51%. This work provides a practical and reproducible blueprint for deploying high-performance, real-time AI on consumer-grade devices by shifting the focus from pure model optimization to hardware-aware inference strategies that maximize throughput.
Related papers
- Agentic Test-Time Scaling for WebAgents [65.5178428849495]
We present Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious.<n>CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling.
arXiv Detail & Related papers (2026-02-12T18:58:30Z) - AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines [3.4381029715186844]
AIE4ML is a framework for converting AI models automatically into optimized firmware targeting the AIE-ML generation devices.<n>We achieve up to 98.6% efficiency relative to the single- kernel baseline.<n>With evaluations across real-world model topologies, we demonstrate that AIE4ML delivers GPU-class throughput under microsecond latency constraints.
arXiv Detail & Related papers (2025-12-17T20:13:05Z) - MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning [91.90342432541138]
Scaling up model size and training data has advanced foundation models for instance-level perception.<n>High computational cost limits adoption on resource-constrained platforms.<n>We introduce a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
arXiv Detail & Related papers (2025-10-16T18:00:00Z) - A Comprehensive Evaluation of YOLO-based Deer Detection Performance on Edge Devices [6.486957474966142]
The escalating economic losses in agriculture due to deer intrusion, estimated to be in the hundreds of millions of dollars annually in the U.S., highlight the inadequacy of traditional mitigation strategies.<n>This underscores a critical need for intelligent, autonomous solutions capable of real-time deer detection and deterrence.<n>This study presents a comprehensive evaluation of state-of-the-art deep learning models for deer detection in challenging real-world scenarios.
arXiv Detail & Related papers (2025-09-24T17:01:50Z) - RoHOI: Robustness Benchmark for Human-Object Interaction Detection [78.18946529195254]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z) - Splitformer: An improved early-exit architecture for automatic speech recognition on edge devices [11.05223262950967]
Speech recognition software needs to be able to adjust the computational load of neural models during inference in a resource aware manner.<n>Early-exit architectures process the input with a subset of their layers, exiting at intermediate branches.<n>For automatic speech recognition applications there are memory-efficient neural architectures that apply variable frame rate analysis.<n>We show that in this way the speech recognition performance on standard benchmarks significantly improve, at the cost of a small increase in the overall number of model parameters.
arXiv Detail & Related papers (2025-06-22T13:34:18Z) - Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers [22.349130691342687]
Video diffusion transformers (vDiTs) have made impressive progress in text-to-video generation, but their high computational demands present major challenges for practical deployment.<n>We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation.
arXiv Detail & Related papers (2025-06-05T14:41:38Z) - Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - Layer Ensemble Averaging for Improving Memristor-Based Artificial Neural Network Performance [0.6560901506023631]
In-memory computation architectures, like memristors, offer promise but face challenges due to hardware non-idealities.
Layer ensemble averaging is a technique to map pre-trained neural network solutions from software to defective hardware crossbars.
Results show that layer ensemble averaging can reliably boost defective memristive network performance up to the software baseline.
arXiv Detail & Related papers (2024-04-24T03:19:31Z) - Sample Less, Learn More: Efficient Action Recognition via Frame Feature
Restoration [59.6021678234829]
We propose a novel method to restore the intermediate features for two sparsely sampled and adjacent video frames.
With the integration of our method, the efficiency of three commonly used baselines has been improved by over 50%, with a mere 0.5% reduction in recognition accuracy.
arXiv Detail & Related papers (2023-07-27T13:52:42Z) - Incremental Online Learning Algorithms Comparison for Gesture and Visual
Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification.
Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z) - Reinforcement Learning with Latent Flow [78.74671595139613]
Flow of Latents for Reinforcement Learning (Flare) is a network architecture for RL that explicitly encodes temporal information through latent vector differences.
We show that Flare recovers optimal performance in state-based RL without explicit access to the state velocity.
We also show that Flare achieves state-of-the-art performance on pixel-based challenging continuous control tasks within the DeepMind control benchmark suite.
arXiv Detail & Related papers (2021-01-06T03:50:50Z) - AIPerf: Automated machine learning as an AI-HPC benchmark [17.57686674304368]
We propose an end-to-end benchmark suite utilizing automated machine learning (AutoML)
We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimization potential on diverse systems.
With flexible workload and single metric, our benchmark can scale and rank AI- HPC easily.
arXiv Detail & Related papers (2020-08-17T08:06:43Z) - Latency-Aware Differentiable Neural Architecture Search [113.35689580508343]
Differentiable neural architecture search methods became popular in recent years, mainly due to their low search costs and flexibility in designing the search space.
However, these methods suffer the difficulty in optimizing network, so that the searched network is often unfriendly to hardware.
This paper deals with this problem by adding a differentiable latency loss term into optimization, so that the search process can tradeoff between accuracy and latency with a balancing coefficient.
arXiv Detail & Related papers (2020-01-17T15:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.