Related papers: Hammer: Robust Function-Calling for On-Device Language Models via Function Masking

Hammer: Robust Function-Calling for On-Device Language Models via Function Masking

URL: http://arxiv.org/abs/2410.04587v2
Date: Thu, 10 Oct 2024 17:29:52 GMT
Title: Hammer: Robust Function-Calling for On-Device Language Models via Function Masking
Authors: Qiqiang Lin, Muning Wen, Qiuying Peng, Guanyu Nie, Junwei Liao, Jun Wang, Xiaoyun Mo, Jiamu Zhou, Cheng Cheng, Yin Zhao, Jun Wang, Weinan Zhang,
Abstract summary: Hammer is a novel family of foundation models specifically engineered for on-device function calling. Our empirical evaluations reveal that Hammer not only outperforms larger models but also demonstrates robust generalization across diverse benchmarks.
Score: 26.495781685810044
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models have demonstrated impressive value in performing as autonomous agents when equipped with external tools and API calls. Nonetheless, effectively harnessing their potential for executing complex tasks crucially relies on enhancements in their function calling capabilities. This paper identifies a critical gap in existing function calling models, where performance varies significantly across benchmarks, often due to being misled by specific naming conventions. To address such an issue, we introduce Hammer, a novel family of foundation models specifically engineered for on-device function calling. Hammer employs an augmented dataset that enhances models' sensitivity to irrelevant functions and incorporates function masking techniques to minimize misleading. Our empirical evaluations reveal that Hammer not only outperforms larger models but also demonstrates robust generalization across diverse benchmarks, achieving sota results. Our open source contributions include a specialized dataset for irrelevance detection, a tuning framework for enhanced generalization, and the Hammer models, establishing a new standard for function calling performance.

Related papers

Small Models, Big Tasks: An Exploratory Empirical Study on Small Language Models for Function Calling [6.102559098873098]
Function calling is a complex task with widespread applications in domains such as information retrieval, software engineering and automation. Large Language Models (LLMs) can automate this process but are computationally expensive and impractical in resource-constrained settings. Small Language Models (SLMs) can operate efficiently, offering faster response times, and lower computational demands.
arXiv Detail & Related papers (2025-04-27T15:26:51Z)
FeRG-LLM : Feature Engineering by Reason Generation Large Language Models [2.6740666148510077]
FeRG-LLM is a large language model designed to automatically perform feature engineering. We have constructed two-stage conversational dialogues that enable language models to analyze machine learning tasks. Experiments show that FeRG-LLM performs comparably to or better than Llama 3.1 70B on most datasets.
arXiv Detail & Related papers (2025-03-30T09:07:21Z)
Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation [85.68881632498909]
We propose a principled framework for synthesizing high-quality training trajectories for large language model agents. The framework is based on automatic and iterative translations from a function signature path to a sequence of queries and executable function calls. Experiments show that training with the positive trajectories with supervised fine-tuning and preference optimization against negative trajectories, our 14B model, Magnet-14B-mDPO, obtains 68.01 on BFCL-v3 and 73.30 on ToolQuery.
arXiv Detail & Related papers (2025-03-10T20:13:07Z)
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use. MeCo captures high-level cognitive signals in the representation space, guiding when to invoke tools. Our experiments show that MeCo accurately detects LLMs' internal cognitive signals and significantly improves tool-use decision-making.
arXiv Detail & Related papers (2025-02-18T15:45:01Z)
Learning Task Representations from In-Context Learning [73.72066284711462]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning. We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads. We show that our method's effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
arXiv Detail & Related papers (2025-02-08T00:16:44Z)
FuXi-$α$: Scaling Recommendation Model with Feature Interaction Enhanced Transformer [81.12174905444229]
Recent advancements have shown that expanding sequential recommendation models to large-scale recommendation models can be an effective strategy. We propose a new model called FuXi-$alpha$ to address these issues. Our model outperforms existing models, with its performance continuously improving as the model size increases.
arXiv Detail & Related papers (2025-02-05T09:46:54Z)
HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios [31.43638572775755]
HammerBench is a novel framework for assessing mobile assistant function-calling capabilities in real-world, multi-turn dialogues. Our experiments reveal that different types of parameter name errors are a significant source of failure across different interaction scenarios.
arXiv Detail & Related papers (2024-12-21T07:33:55Z)
Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks [0.8425561594225592]
This study introduces a novel framework for training smaller language models in function calling. It focuses on specific logical and mathematical reasoning tasks. The approach aims to improve performances of small-scale models for these tasks using function calling.
arXiv Detail & Related papers (2024-10-24T16:27:35Z)
CITI: Enhancing Tool Utilizing Ability in Large Language Models without Sacrificing General Performance [17.723293304671877]
We propose a Component-based Tool-utilizing ability Injection method (CITI) According to the gradient-based importance score of different components, CITI alleviates the capability conflicts caused by fine-tuning process. Experimental results demonstrate that our approach achieves outstanding performance across a range of evaluation metrics.
arXiv Detail & Related papers (2024-09-20T04:06:28Z)
How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics [17.086867242274813]
We analyse how performance develops as a function of model characteristics like number of parameters, or type of training. We find that while there is a clear relationship between number of parameters and performance, there is still a wide spread of performance points within a given size bracket. We also find a certain degree of unpredictability about performance across access methods, possible due to unexposed sampling parameters.
arXiv Detail & Related papers (2024-06-20T07:17:09Z)
FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation [73.454943870226]
Language models have shown impressive in-context-learning capabilities. We propose a measure called FamiCom, providing a more comprehensive measure for task-agnostic performance estimation.
arXiv Detail & Related papers (2024-06-17T06:14:55Z)
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking [53.66999416757543]
We study how fine-tuning affects the internal mechanisms implemented in language models. Fine-tuning enhances, rather than alters, the mechanistic operation of the model.
arXiv Detail & Related papers (2024-02-22T18:59:24Z)
Anchor function: a type of benchmark functions for studying language models [18.005251277048178]
We propose the concept of an anchor function to study language models in learning tasks that follow an "anchor-key" pattern. The anchor function plays a role analogous to that of mice in diabetes research, particularly suitable for academic research.
arXiv Detail & Related papers (2024-01-16T12:10:49Z)
Specialist or Generalist? Instruction Tuning for Specific NLP Tasks [58.422495509760154]
We investigate whether incorporating broad-coverage generalist instruction tuning can contribute to building a specialist model. Our experiments assess four target tasks with distinct coverage levels. The effect is particularly pronounced when the amount of task-specific training data is limited.
arXiv Detail & Related papers (2023-10-23T19:46:48Z)
On the Efficacy of Generalization Error Prediction Scoring Functions [33.24980750651318]
Generalization error predictors (GEPs) aim to predict model performance on unseen distributions by deriving dataset-level error estimates from sample-level scores. We rigorously study the effectiveness of popular scoring functions (confidence, local manifold smoothness, model agreement) independent of mechanism choice.
arXiv Detail & Related papers (2023-03-23T18:08:44Z)
Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems. Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored. We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z)
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models [648.3665819567409]
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Big-bench consists of 204 tasks, contributed by 450 authors across 132 institutions. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench.
arXiv Detail & Related papers (2022-06-09T17:05:34Z)
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models. It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge. We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.