TinyLLM: Evaluation and Optimization of Small Language Models for Agentic Tasks on Edge Devices
- URL: http://arxiv.org/abs/2511.22138v1
- Date: Thu, 27 Nov 2025 06:09:54 GMT
- Title: TinyLLM: Evaluation and Optimization of Small Language Models for Agentic Tasks on Edge Devices
- Authors: Mohd Ariful Haque, Fahad Rahman, Kishor Datta Gupta, Khalil Shujaee, Roy George,
- Abstract summary: This paper investigates the effectiveness of small language models (SLMs) for agentic tasks (function/tool/API calling)<n>We describe parameter-driven optimization strategies that include supervised fine-tuning (SFT), parameter-efficient fine-tuning (PEFT), reinforcement learning (RL), and hybrid methods.<n>Our results demonstrate clear accuracy differences across model scales where medium-sized models (1-3B parameters) significantly outperform ultra-compact models (1B parameters)<n>This study highlights the importance of hybrid optimization strategies that enable small language models to deliver accurate, efficient, and stable agentic AI on edge devices.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper investigates the effectiveness of small language models (SLMs) for agentic tasks (function/tool/API calling) with a focus on running agents on edge devices without reliance on cloud infrastructure. We evaluate SLMs using the Berkeley Function Calling Leaderboard (BFCL) framework and describe parameter-driven optimization strategies that include supervised fine-tuning (SFT), parameter-efficient fine-tuning (PEFT), reinforcement learning (RL)-based optimization, preference alignment via Direct Preference Optimization (DPO), and hybrid methods. We report results for models including TinyAgent, TinyLlama, Qwen, and xLAM across BFCL categories (simple, multiple, parallel, parallel-multiple, and relevance detection), both in live and non-live settings, and in multi-turn evaluations. We additionally detail a DPO training pipeline constructed from AgentBank data (e.g., ALFRED), including our conversion of SFT data to chosen-rejected pairs using TinyLlama responses as rejected outputs and manual validation. Our results demonstrate clear accuracy differences across model scales where medium-sized models (1-3B parameters) significantly outperform ultra-compact models (<1B parameters), achieving up to 65.74% overall accuracy, and 55.62% multi-turn accuracy with hybrid optimization. This study highlights the importance of hybrid optimization strategies that enable small language models to deliver accurate, efficient, and stable agentic AI on edge devices, making privacy-preserving, low-latency autonomous agents practical beyond the cloud.
Related papers
- Relation-Aware Bayesian Optimization of DBMS Configurations Guided by Affinity Scores [2.474203056060563]
Database Management Systems (DBMSs) are fundamental for managing large-scale and heterogeneous data, and their performance is critically influenced by configuration parameters.<n>Recent research has focused on automated configuration optimization using machine learning; however, existing approaches still exhibit several key limitations.<n>We propose RelTune, a novel framework that represents parameter dependencies as a Graph and learns GNN-based latent embeddings that encode performancerelevant semantics.
arXiv Detail & Related papers (2025-10-31T03:46:42Z) - Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting [92.57796055887995]
We introduce ECHO, a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents.<n> ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts.<n>We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation.
arXiv Detail & Related papers (2025-10-11T18:11:09Z) - Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization [13.271737599933147]
We introduce EntroPO, an entropy-enhanced framework that adapts existing preference optimization algorithms to the multi-turn, tool-assisted setting.<n>We validate EntroPO by fine-tuning a diverse suite of models from different families and sizes.<n>On the swebench leaderboard, our approach establishes new state-of-the-art results among open-weight models.
arXiv Detail & Related papers (2025-09-15T20:36:19Z) - PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning [54.99373314906667]
Self-supervised representation learning for point cloud has demonstrated effectiveness in improving pre-trained model performance across diverse tasks.<n>As pre-trained models grow in complexity, fully fine-tuning them for downstream applications demands substantial computational and storage resources.<n>We propose PointLoRA, a simple yet effective method that combines low-rank adaptation (LoRA) with multi-scale token selection to efficiently fine-tune point cloud models.
arXiv Detail & Related papers (2025-04-22T16:41:21Z) - Optuna vs Code Llama: Are LLMs a New Paradigm for Hyperparameter Tuning? [45.58422897857411]
This work explores the use of large language models (LLMs) for hyperparameter optimization by fine-tuning a parameter-efficient version of Code Llama using LoRA.<n>Our approach achieves competitive or superior Root Mean Square Error (RMSE) while substantially reducing computational overhead.<n>Results demonstrate that LLM-based optimization not only rivals established Bayesian methods like Tree-structured Parzen Estimators (TPE) but also accelerates tuning for real-world applications requiring perceptual quality and low-latency processing.
arXiv Detail & Related papers (2025-04-08T13:15:47Z) - Pruning-Based TinyML Optimization of Machine Learning Models for Anomaly Detection in Electric Vehicle Charging Infrastructure [8.29566258132752]
This paper investigates a pruning method for anomaly detection in resource-constrained environments, specifically targeting EVCI.<n> optimized models achieved significant reductions in model size and inference times, with only a marginal impact on their performance.<n> Notably, our findings indicate that, in the context of EVCI, pruning and FS can enhance computational efficiency while retaining critical anomaly detection capabilities.
arXiv Detail & Related papers (2025-03-19T00:18:37Z) - Predictable Scale: Part I, Step Law -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining [59.369484219304866]
We conduct an unprecedented empirical investigation training over 3,700 Large Language Models (LLMs) from scratch across 100 trillion tokens.<n>We establish a universal Scaling Law for hyperparameter optimization in LLM Pre-training, called Step Law.<n>Our estimated optima deviates from the global best performance found via exhaustive search by merely 0.094% on the test set.
arXiv Detail & Related papers (2025-03-06T18:58:29Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Low-Rank Representations Meets Deep Unfolding: A Generalized and
Interpretable Network for Hyperspectral Anomaly Detection [41.50904949744355]
Current hyperspectral anomaly detection (HAD) benchmark datasets suffer from low resolution, simple background, and small size of the detection data.
These factors also limit the performance of the well-known low-rank representation (LRR) models in terms of robustness.
We build a new set of HAD benchmark datasets for improving the robustness of the HAD algorithm in complex scenarios, AIR-HAD for short.
arXiv Detail & Related papers (2024-02-23T14:15:58Z) - Federated Learning of Large Language Models with Parameter-Efficient
Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data.
The training process of Large Language Models (LLMs) generally incurs the update of significant parameters.
This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.