Related papers: Beyond Semantic Similarity: Reducing Unnecessary API Calls via Behavior-Aligned Retriever

Beyond Semantic Similarity: Reducing Unnecessary API Calls via Behavior-Aligned Retriever

URL: http://arxiv.org/abs/2508.14323v2
Date: Mon, 25 Aug 2025 04:46:11 GMT
Title: Beyond Semantic Similarity: Reducing Unnecessary API Calls via Behavior-Aligned Retriever
Authors: Yixin Chen, Ying Xiong, Shangyu Wu, Yufei Cui, Xue Liu, Nan Guan, Chun Jason Xue,
Abstract summary: Tool-augmented large language models (LLMs) leverage external functions to extend their capabilities.<n>Inaccurate function calls can lead to inefficiencies and increased costs.<n>Existing methods address this challenge by fine-tuning LLMs or using demonstration-based prompting.<n>We trained a behavior-aligned retriever (BAR) which provides behaviorally consistent demonstrations.
Score: 28.307080649683403
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tool-augmented large language models (LLMs) leverage external functions to extend their capabilities, but inaccurate function calls can lead to inefficiencies and increased costs.Existing methods address this challenge by fine-tuning LLMs or using demonstration-based prompting, yet they often suffer from high training overhead and fail to account for inconsistent demonstration samples, which misguide the model's invocation behavior. In this paper, we trained a behavior-aligned retriever (BAR), which provides behaviorally consistent demonstrations to help LLMs make more accurate tool-using decisions. To train the BAR, we construct a corpus including different function-calling behaviors, i.e., calling or non-calling.We use the contrastive learning framework to train the BAR with customized positive/negative pairs and a dual-negative contrastive loss, ensuring robust retrieval of behaviorally consistent examples.Experiments demonstrate that our approach significantly reduces erroneous function calls while maintaining high task performance, offering a cost-effective and efficient solution for tool-augmented LLMs.

Related papers

On the Paradoxical Interference between Instruction-Following and Task Solving [50.75960598434753]
Instruction following aims to align Large Language Models (LLMs) with human intent by specifying explicit constraints on how tasks should be performed.<n>We reveal a counterintuitive phenomenon: instruction following can paradoxically interfere with LLMs' task-solving capability.<n>We propose a metric, SUSTAINSCORE, to quantify the interference of instruction following with task solving.
arXiv Detail & Related papers (2026-01-29T17:48:56Z)
Exploring Weaknesses in Function Call Models via Reinforcement Learning: An Adversarial Data Augmentation Approach [1.4795423578096045]
We propose a novel adversarial data augmentation method to improve function call capabilities of Large Language Models (LLMs)<n>Our training framework introduces a query model trained with reinforcement learning to generate adversarial queries that are specifically designed to challenge function call (FC) models.<n>Overall, our method advances the development of more robust FC models and provides a systematic way to identify and correct weaknesses in the ability of LLMs to interact with external tools.
arXiv Detail & Related papers (2026-01-27T02:49:07Z)
ET-Agent: Incentivizing Effective Tool-Integrated Reasoning Agent via Behavior Calibration [68.89572566071575]
ETAgent is a training framework for calibrating agent's tool-use behavior.<n>It is designed to progressively calibrate erroneous behavioral patterns to optimal behaviors.
arXiv Detail & Related papers (2026-01-11T11:05:26Z)
BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning [82.925106913459]
Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning.<n>We introduce BOTS, a unified framework for Bayesian Online Task Selection in RFT reinforcement finetuning.
arXiv Detail & Related papers (2025-10-30T11:15:23Z)
Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions [10.598440138966028]
Current self-reflection practices rely on prompts or one-way reasoning.<n>We propose structured reflection, which turns the path from error to repair into an explicit, controllable, and trainable action.<n>Experiments on BFCL v3 and Tool-Reflection-Bench show large gains in multi-turn tool-call success and error recovery, and a reduction of redundant calls.
arXiv Detail & Related papers (2025-09-23T09:35:49Z)
Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates [56.73907811047611]
Large language models (LLMs) have demonstrated strong reasoning and tool-use capabilities.<n>LLMs often fail in real-world tool-interactions due to incorrect parameterization, poor tool selection, or misinterpretation of user intent.<n>We introduce a curriculum-inspired framework that leverages structured reasoning templates to guide LLMs through more deliberate step-by-step instructions for generating function callings.
arXiv Detail & Related papers (2025-09-22T17:55:14Z)
Leveraging In-Context Learning for Language Model Agents [51.2996117207114]
In-context learning (ICL) with dynamically selected demonstrations combines the flexibility of prompting large language models (LLMs) with the ability to leverage training data to improve performance.<n>We show that set-selection of trajectories of similar tasks as demonstrations significantly improves performance, reliability, robustness, and efficiency of LLM agents.<n>We find that demonstrations obtained from larger models (in the annotation phase) also improve smaller models, and that ICL agents can even rival costlier trained agents.
arXiv Detail & Related papers (2025-06-16T05:37:49Z)
FunReason: Enhancing Large Language Models' Function Calling via Self-Refinement Multiscale Loss and Automated Data Refinement [23.301601376960104]
We introduce FunReason, a framework that enhances large language models' function calling capabilities.<n>FunReason generates high-quality training examples, focusing on parseability, reasoning coherence, and function call precision.<n>FunReason achieves performance comparable to GPT-4o while effectively mitigating catastrophic forgetting during fine-tuning.
arXiv Detail & Related papers (2025-05-26T16:38:06Z)
Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs [15.806503459642665]
We propose a new algorithm for fine-tuning large language models using reinforcement learning.<n>We show that properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency.<n>As a corollary to this work, we find that REINFORCE's baseline parameter plays an important and unexpected role in defining dataset composition in the presence of negative examples.
arXiv Detail & Related papers (2025-03-18T14:23:37Z)
Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation [85.68881632498909]
We propose a principled framework for synthesizing high-quality training trajectories for large language model agents.<n>The framework is based on automatic and iterative translations from a function signature path to a sequence of queries and executable function calls.<n> Experiments show that training with the positive trajectories with supervised fine-tuning and preference optimization against negative trajectories, our 14B model, Magnet-14B-mDPO, obtains 68.01 on BFCL-v3 and 73.30 on ToolQuery.
arXiv Detail & Related papers (2025-03-10T20:13:07Z)
Alopex: A Computational Framework for Enabling On-Device Function Calls with LLMs [31.961168273386757]
Alopex is a framework that enables precise on-device function calls using the Fox Large Language Models. A data mixing strategy is used to mitigate catastrophic forgetting, combining function call data with textbook datasets to enhance performance in various tasks.
arXiv Detail & Related papers (2024-11-07T22:15:17Z)
Learning to Ask: When LLM Agents Meet Unclear Instruction [55.65312637965779]
Large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone.<n>We evaluate the performance of LLMs tool-use under imperfect instructions, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench.<n>We propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions.
arXiv Detail & Related papers (2024-08-31T23:06:12Z)
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing [63.20133320524577]
We show that editing a small subset of parameters can effectively modulate specific behaviors of large language models (LLMs)<n>Our approach achieves reductions of up to 90.0% in toxicity on the RealToxicityPrompts dataset and 49.2% on ToxiGen.
arXiv Detail & Related papers (2024-07-11T17:52:03Z)
Inverse-RLignment: Large Language Model Alignment from Demonstrations through Inverse Reinforcement Learning [62.05713042908654]
We introduce Alignment from Demonstrations (AfD), a novel approach leveraging high-quality demonstration data to overcome these challenges.<n>We formalize AfD within a sequential decision-making framework, highlighting its unique challenge of missing reward signals.<n> Practically, we propose a computationally efficient algorithm that extrapolates over a tailored reward model for AfD.
arXiv Detail & Related papers (2024-05-24T15:13:53Z)
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics. It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z)
Dr.ICL: Demonstration-Retrieved In-context Learning [29.142262267850704]
In-context learning (ICL) teaching a large language model to perform a task with few-shot demonstrations has emerged as a strong paradigm for using LLMs. Recent research suggests that retrieving semantically similar demonstrations to the input from a pool of available demonstrations results in better performance. This work expands the applicability of retrieval-based ICL approaches by demonstrating that even simple word-overlap similarity measures such as BM25 outperform randomly selected demonstrations.
arXiv Detail & Related papers (2023-05-23T14:55:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.