Related papers: ScaleCall -- Agentic Tool Calling at Scale for Fintech: Challenges, Methods, and Deployment Insights

ScaleCall -- Agentic Tool Calling at Scale for Fintech: Challenges, Methods, and Deployment Insights

URL: http://arxiv.org/abs/2511.00074v1
Date: Wed, 29 Oct 2025 14:35:46 GMT
Title: ScaleCall -- Agentic Tool Calling at Scale for Fintech: Challenges, Methods, and Deployment Insights
Authors: Richard Osuagwu, Thomas Cook, Maraim Masoud, Koustav Ghosal, Riccardo Mattivi,
Abstract summary: Large Language Models (LLMs) excel at tool calling, deploying these capabilities in regulated enterprise environments such as toolsets.<n>We present a comprehensive study of tool retrieval methods for enterprise environments through the development and deployment of ScaleCall, a prototype tool-calling framework within Mastercard.
Score: 0.18643247155980827
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: While Large Language Models (LLMs) excel at tool calling, deploying these capabilities in regulated enterprise environments such as fintech presents unique challenges due to on-premises constraints, regulatory compliance requirements, and the need to disambiguate large, functionally overlapping toolsets. In this paper, we present a comprehensive study of tool retrieval methods for enterprise environments through the development and deployment of ScaleCall, a prototype tool-calling framework within Mastercard designed for orchestrating internal APIs and automating data engineering workflows. We systematically evaluate embedding-based retrieval, prompt-based listwise ranking, and hybrid approaches, revealing that method effectiveness depends heavily on domain-specific factors rather than inherent algorithmic superiority. Through empirical investigation on enterprise-derived benchmarks, we find that embedding-based methods offer superior latency for large tool repositories, while listwise ranking provides better disambiguation for overlapping functionalities, with hybrid approaches showing promise in specific contexts. We integrate our findings into ScaleCall's flexible architecture and validate the framework through real-world deployment in Mastercard's regulated environment. Our work provides practical insights into the trade-offs between retrieval accuracy, computational efficiency, and operational requirements, contributing to the understanding of tool-calling system design for enterprise applications in regulated industries.

Related papers

Synthesizing Procedural Memory: Challenges and Architectures in Automated Workflow Generation [0.5599792629509229]
This paper operationalizes the transition of Large Language Models from passive tool-users to active workflow architects.<n>We demonstrate that by enforcing a scientific methodology of hypothesize, probe, and code, agents can autonomously write robust, production-grade code skills.
arXiv Detail & Related papers (2025-12-23T11:33:32Z)
Z-Space: A Multi-Agent Tool Orchestration Framework for Enterprise-Grade LLM Automation [3.518072776386001]
This paper proposes Z-Space, a data-generation-oriented multi-agent collaborative tool invocation framework.<n>The framework has been deployed in the Eleme platform's technical division, serving large-scale test data generation scenarios.<n>Production data demonstrates that the system reduces average token consumption in tool inference by 96.26%.
arXiv Detail & Related papers (2025-11-23T03:59:14Z)
LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering [90.84806758077536]
We introduce textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess large language models (LLMs) agents in realistic, long-context software engineering.<n>Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations.<n>Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens.
arXiv Detail & Related papers (2025-11-17T23:57:24Z)
EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence [17.644658293987955]
Embodied AI agents are capable of robust spatial perception, effective task planning, and adaptive execution in physical environments.<n>Current large language models (LLMs) and multimodal LLMs (MLLMs) for embodied tasks suffer from key limitations.<n>We propose EmbodiedBrain, a novel vision-language foundation model available in both 7B and 32B parameter sizes.
arXiv Detail & Related papers (2025-10-23T14:05:55Z)
A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System [56.40989626804489]
This survey provides the first holistic analysis of Large Language Models-powered software engineering.<n>We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair.
arXiv Detail & Related papers (2025-10-10T06:56:50Z)
Rationale-Augmented Retrieval with Constrained LLM Re-Ranking for Task Discovery [4.061135251278187]
Head Start programs utilizing GoEngage face significant challenges when new or rotating staff attempt to locate appropriate Tasks on the platform homepage.<n>These difficulties arise from domain-specific jargon, system-specific nomenclature, and the inherent limitations of lexical search in handling typos and varied word ordering.<n>We propose a pragmatic hybrid semantic search system that combines lightweight typo-tolerant lexical retrieval, embedding-based vector similarity, and constrained large language model (LLM) re-ranking.
arXiv Detail & Related papers (2025-10-01T01:28:59Z)
Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments [70.42705564227548]
We propose an automated environment construction pipeline for large language models (LLMs)<n>This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools.<n>We also introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution.
arXiv Detail & Related papers (2025-08-12T09:45:19Z)
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering [57.156093929365255]
Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents.<n>MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios.<n>Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-12T17:35:43Z)
Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey [58.50944604905037]
Edge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications.<n>Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed systems.<n>This survey provides a structured tutorial on fundamental architectures, enabling technologies, and emerging applications.
arXiv Detail & Related papers (2025-05-03T13:55:38Z)
OntoAligner: A Comprehensive Modular and Robust Python Toolkit for Ontology Alignment [0.4499833362998487]
Ontology Alignment (OA) is fundamental for achieving interoperability across diverse knowledge systems.<n>We present OntoAligner, a comprehensive, modular, and robust Python toolkit for OA alignment.
arXiv Detail & Related papers (2025-03-27T18:28:11Z)
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use.<n>MeCo quantifies metacognitive scores by capturing high-level cognitive signals in the representation space.<n>MeCo is fine-tuning-free and incurs minimal cost.
arXiv Detail & Related papers (2025-02-18T15:45:01Z)
Advancing Code Coverage: Incorporating Program Analysis with Large Language Models [8.31978033489419]
We propose TELPA, a novel technique to generate tests that can reach hard-to-cover branches.<n>Our experimental results on 27 open-source Python projects demonstrate that TELPA significantly outperforms the state-of-the-art SBST and LLM-based techniques.
arXiv Detail & Related papers (2024-04-07T14:08:28Z)
Modular approach to data preprocessing in ALOHA and application to a smart industry use case [0.0]
The paper addresses a modular approach, integrated into the ALOHA tool flow, to support the data preprocessing and transformation pipeline. To demonstrate the effectiveness of the approach, we present some experimental results related to a keyword spotting use case.
arXiv Detail & Related papers (2021-02-02T06:48:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.