Octopus v2: On-device language model for super agent
- URL: http://arxiv.org/abs/2404.01744v5
- Date: Tue, 16 Apr 2024 16:39:51 GMT
- Title: Octopus v2: On-device language model for super agent
- Authors: Wei Chen, Zhiyuan Li,
- Abstract summary: Our research presents a new method that empowers an on-device model with 2 billion parameters to surpass the performance of GPT-4 in both accuracy and latency.
When compared to Llama-7B with a RAG-based function calling mechanism, our method enhances latency by 35-fold.
- Score: 10.998608318944985
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Language models have shown effectiveness in a variety of software applications, particularly in tasks related to automatic workflow. These models possess the crucial ability to call functions, which is essential in creating AI agents. Despite the high performance of large-scale language models in cloud environments, they are often associated with concerns over privacy and cost. Current on-device models for function calling face issues with latency and accuracy. Our research presents a new method that empowers an on-device model with 2 billion parameters to surpass the performance of GPT-4 in both accuracy and latency, and decrease the context length by 95\%. When compared to Llama-7B with a RAG-based function calling mechanism, our method enhances latency by 35-fold. This method reduces the latency to levels deemed suitable for deployment across a variety of edge devices in production environments, aligning with the performance requisites for real-world applications.
Related papers
- Small Models, Big Tasks: An Exploratory Empirical Study on Small Language Models for Function Calling [6.102559098873098]
Function calling is a complex task with widespread applications in domains such as information retrieval, software engineering and automation.
Large Language Models (LLMs) can automate this process but are computationally expensive and impractical in resource-constrained settings.
Small Language Models (SLMs) can operate efficiently, offering faster response times, and lower computational demands.
arXiv Detail & Related papers (2025-04-27T15:26:51Z) - Token Level Routing Inference System for Edge Devices [21.721914273034972]
We present a novel collaborative decoding inference system that allows small models to perform on-device inference while selectively consulting a cloud-based large model for critical token generation.
Remarkably, the system achieves a 60% performance gain on CommonsenseQA using only a 0.5B model on an M1 MacBook, with under 7% of tokens generation uploaded to the large model in the cloud.
arXiv Detail & Related papers (2025-04-10T15:54:19Z) - EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models [64.18350535770357]
We propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning.
Our approach only leverages a small number of samples to search for the desired pruning policy.
We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering.
arXiv Detail & Related papers (2025-03-19T16:07:04Z) - PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing [48.30406812516552]
We introduce the PLM, a Peripheral Language Model, developed through a co-design process that jointly optimize model architecture and edge system constraints.
PLM employs a Multi-head Latent Attention mechanism and employs the squared ReLU activation function to encourage sparsity, thereby reducing peak memory footprint.
evaluation results demonstrate that PLM outperforms existing small language models trained on publicly available data.
arXiv Detail & Related papers (2025-03-15T15:11:17Z) - Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use.
MeCo captures high-level cognitive signals in the representation space, guiding when to invoke tools.
Our experiments show that MeCo accurately detects LLMs' internal cognitive signals and significantly improves tool-use decision-making.
arXiv Detail & Related papers (2025-02-18T15:45:01Z) - FuXi-$α$: Scaling Recommendation Model with Feature Interaction Enhanced Transformer [81.12174905444229]
Recent advancements have shown that expanding sequential recommendation models to large-scale recommendation models can be an effective strategy.
We propose a new model called FuXi-$alpha$ to address these issues.
Our model outperforms existing models, with its performance continuously improving as the model size increases.
arXiv Detail & Related papers (2025-02-05T09:46:54Z) - Adaptive Rank Allocation for Federated Parameter-Efficient Fine-Tuning of Language Models [40.69348434971122]
We propose FedARA, a novel Adaptive Rank Allocation framework for federated parameter-efficient fine-tuning of language models.
FedARA consistently outperforms baselines by an average of 6.95% to 8.49% across various datasets and models under heterogeneous data.
Experiments on various edge devices demonstrate substantial decreases in total training time and energy consumption by up to 48.90% and 46.95%, respectively.
arXiv Detail & Related papers (2025-01-24T11:19:07Z) - Less is More: Optimizing Function Calling for LLM Execution on Edge Devices [0.44784055850794474]
Large Language Models (LLMs) struggle with function calling at the edge because they cannot handle complex inputs or manage multiple tools effectively.
We introduce Less-is-More, a novel fine-tuning-free function-calling scheme for dynamic tool selection.
Our approach is based on the key insight that selectively reducing the number of tools available to LLMs significantly improves their function-calling performance, execution time, and power efficiency on edge devices.
arXiv Detail & Related papers (2024-11-23T00:51:09Z) - Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - Compressing Large Language Models with Automated Sub-Network Search [41.452512557226335]
We consider model compression for Large Language Models to reduce model size while improving downstream task performance.
We phrase this as a neural architecture search problem that automatically prunes structural components.
Our method achieves upto 9.85% improvement on average on 11 diverse downstream tasks, while achieving up to 22% improvement of on-device latency.
arXiv Detail & Related papers (2024-10-09T02:14:39Z) - xLAM: A Family of Large Action Models to Empower AI Agent Systems [111.5719694445345]
We release xLAM, a series of large action models designed for AI agent tasks.
xLAM consistently delivers exceptional performance across multiple agent ability benchmarks.
arXiv Detail & Related papers (2024-09-05T03:22:22Z) - ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data.
We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z) - The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines [6.381783966294295]
Open-source large language models (LLMs) enable developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance.
We analyze the performance, particularly the throughput (tokens generated per unit of time) of 20 LLMs using two inference libraries: vLLM and HuggingFace's pipelines.
arXiv Detail & Related papers (2024-08-02T06:56:59Z) - Knowledge boosting during low-latency inference [20.617827647115874]
Models for low-latency, streaming applications could benefit from the knowledge capacity of larger models, but edge devices cannot run these models due to resource constraints.
We propose knowledge boosting, a novel technique that allows a large model to operate on time-delayed input during inference, while still boosting small model performance.
Our results show larger gains where the performance gap between the small and large models is wide, demonstrating a promising method for large-small model collaboration for low-latency applications.
arXiv Detail & Related papers (2024-07-09T22:04:23Z) - On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts.
We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries.
Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z) - A Deep Recurrent-Reinforcement Learning Method for Intelligent AutoScaling of Serverless Functions [18.36339203254509]
F introduces a lightweight, function-based cloud execution model that finds its relevance in a range of applications like IoT-edge data processing and anomaly detection.
arXiv Detail & Related papers (2023-08-11T04:41:19Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - DUET: A Tuning-Free Device-Cloud Collaborative Parameters Generation Framework for Efficient Device Model Generalization [66.27399823422665]
Device Model Generalization (DMG) is a practical yet under-investigated research topic for on-device machine learning applications.
We propose an efficient Device-cloUd collaborative parametErs generaTion framework DUET.
arXiv Detail & Related papers (2022-09-12T13:26:26Z) - Efficient Person Search: An Anchor-Free Approach [86.45858994806471]
Person search aims to simultaneously localize and identify a query person from realistic, uncropped images.
To achieve this goal, state-of-the-art models typically add a re-id branch upon two-stage detectors like Faster R-CNN.
In this work, we present an anchor-free approach to efficiently tackling this challenging task, by introducing the following dedicated designs.
arXiv Detail & Related papers (2021-09-01T07:01:33Z) - Communication-Computation Efficient Device-Edge Co-Inference via AutoML [4.06604174802643]
Device-edge co-inference partitions a deep neural network between a resource-constrained mobile device and an edge server.
On-device model sparsity level and intermediate feature compression ratio have direct impacts on workload and communication overhead.
We propose a novel automated machine learning (AutoML) framework based on deep reinforcement learning (DRL)
arXiv Detail & Related papers (2021-08-30T06:36:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.