On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset
- URL: http://arxiv.org/abs/2510.22898v1
- Date: Mon, 27 Oct 2025 00:58:48 GMT
- Title: On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset
- Authors: Vishvesh Bhat, Omkar Ghugarkar, Julian McAuley,
- Abstract summary: Generalization across Agentic tool-calling environments remains a key unsolved challenge in developing reliable reasoning systems.<n>We present a framework that augments large language models with a lightweight symbolic reasoning layer for structured decomposition and adaptive tool orchestration.
- Score: 16.921428284844684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generalization across Agentic tool-calling environments remains a key unsolved challenge in developing reliable agentic reasoning systems. While large language models (LLMs) demonstrate strong performance on isolated benchmarks, their ability to transfer reasoning strategies and co-ordinate tools across diverse domains is poorly understood. In this work, we conduct a large-scale evaluation of state-of-the-art LLMs on multiple tool-calling benchmarksBFCL v3, TauBench, Tau2Bench, and AceBenchand introduce MAVEN (Math & Physics Adversarial Verification & Evaluation Network), a new out of distribution (OOD) benchmark designed to stress-test multi-step reasoning through explicit verification and adversarial task composition. Our results show that most current models achieve below 50% accuracy on MAVEN, revealing a significant generalization gap across tool-use settings. To address this, we present the CoreThink Agentic Reasoner, a framework that augments LLMs with a lightweight symbolic reasoning layer for structured decomposition and adaptive tool orchestration. Without additional training, it generalizes across all benchmarks, achieving state-of-the-art performance with 530% improvements over existing baselines at roughly one-tenth the computational cost.
Related papers
- A Comprehensive Evaluation of LLM Reasoning: From Single-Model to Multi-Agent Paradigms [20.241519889633285]
Large Language Models (LLMs) are increasingly deployed as reasoning systems, where reasoning paradigms play a critical role.<n>We conduct a comprehensive and unified evaluation of reasoning paradigms, spanning direct single-model generation, CoT-augmented single-model reasoning, and representative MAS.<n>We introduce MIMeBench, a new open-ended benchmark that targets two foundational yet underexplored semantic capabilities.
arXiv Detail & Related papers (2026-01-19T17:23:45Z) - ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning [103.7657839292775]
ARM-Thinker is an Agentic multimodal Reward Model that autonomously invokes external tools to ground judgments in verifiable evidence.<n>We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy.<n>Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
arXiv Detail & Related papers (2025-12-04T18:59:52Z) - Agentic AI Reasoning for Mobile Edge General Intelligence: Fundamentals, Approaches, and Directions [74.35421055079655]
Large language models (LLMs) have enabled an emergence of agentic artificial intelligence (AI) with powerful reasoning and autonomous decision-making capabilities.<n>Mobile Edge General Intelligence (MEGI) brings real-time, privacy-preserving reasoning to the network edge.<n>We propose a joint optimization framework for efficient LLM reasoning deployment in MEGI.
arXiv Detail & Related papers (2025-09-27T10:53:48Z) - RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment [20.416910591388618]
We introduce RefactorCoderQA, a benchmark designed to evaluate and enhance the performance of Large Language Models (LLMs) across coding tasks.<n>Our fine-tuned model, RefactorCoder-MoE, achieves state-of-the-art performance, significantly outperforming leading open-source and commercial baselines with an overall accuracy of 76.84%.
arXiv Detail & Related papers (2025-09-12T17:44:22Z) - TableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning [10.267950603662776]
TableMind is a tool-integrated table reasoning agent that autonomously performs multi-turn tool invocation, writes and executes code in a secure sandbox environment for data analysis and precise numerical reasoning.<n>To realize these capabilities, we adopt a two-stage fine-tuning paradigm built on top of a powerful pre-trained language model.
arXiv Detail & Related papers (2025-09-08T02:00:31Z) - How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench [58.114899897566964]
In a multi-turn conversational environment, large language models (LLMs) often struggle with consistent reasoning and adherence to domain-specific policies.<n>We propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules.<n>IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively.
arXiv Detail & Related papers (2025-08-28T15:57:33Z) - MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use [72.53177559476704]
We introduce MCPVerse, a real-world benchmark for evaluating agentic tool use.<n> MCPVerse integrates more than 550 real-world, executable tools to create an unprecedented action space exceeding 140k tokens.<n>We benchmarked the state-of-the-art LLMs across three modes (Oracle, Standard, and Max-Scale)
arXiv Detail & Related papers (2025-08-22T09:47:53Z) - Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research [32.92036657863354]
Language agents powered by large language models (LLMs) have demonstrated remarkable capabilities in understanding, reasoning, and executing complex tasks.<n>However, developing robust agents presents significant challenges: substantial engineering overhead, lack of standardized components, and insufficient evaluation frameworks for fair comparison.<n>We introduce Agent Graph-based Orchestration for Reasoning and Assessment (AGORA), a flexible and abstraction framework that addresses these challenges.
arXiv Detail & Related papers (2025-05-30T08:46:23Z) - General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z) - Computational Reasoning of Large Language Models [51.629694188014064]
We introduce textbfTuring Machine Bench, a benchmark to assess the ability of Large Language Models (LLMs) to execute reasoning processes.<n> TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a theoretical foundation based on Turing machine.
arXiv Detail & Related papers (2025-04-29T13:52:47Z) - AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.<n>AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.<n>This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.