Related papers: On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset

On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset

URL: http://arxiv.org/abs/2510.22898v1
Date: Mon, 27 Oct 2025 00:58:48 GMT
Title: On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset
Authors: Vishvesh Bhat, Omkar Ghugarkar, Julian McAuley,
Abstract summary: Generalization across Agentic tool-calling environments remains a key unsolved challenge in developing reliable reasoning systems.<n>We present a framework that augments large language models with a lightweight symbolic reasoning layer for structured decomposition and adaptive tool orchestration.
Score: 16.921428284844684
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generalization across Agentic tool-calling environments remains a key unsolved challenge in developing reliable agentic reasoning systems. While large language models (LLMs) demonstrate strong performance on isolated benchmarks, their ability to transfer reasoning strategies and co-ordinate tools across diverse domains is poorly understood. In this work, we conduct a large-scale evaluation of state-of-the-art LLMs on multiple tool-calling benchmarksBFCL v3, TauBench, Tau2Bench, and AceBenchand introduce MAVEN (Math & Physics Adversarial Verification & Evaluation Network), a new out of distribution (OOD) benchmark designed to stress-test multi-step reasoning through explicit verification and adversarial task composition. Our results show that most current models achieve below 50% accuracy on MAVEN, revealing a significant generalization gap across tool-use settings. To address this, we present the CoreThink Agentic Reasoner, a framework that augments LLMs with a lightweight symbolic reasoning layer for structured decomposition and adaptive tool orchestration. Without additional training, it generalizes across all benchmarks, achieving state-of-the-art performance with 530% improvements over existing baselines at roughly one-tenth the computational cost.

Related papers

A Comprehensive Evaluation of LLM Reasoning: From Single-Model to Multi-Agent Paradigms [20.241519889633285]
Large Language Models (LLMs) are increasingly deployed as reasoning systems, where reasoning paradigms play a critical role.<n>We conduct a comprehensive and unified evaluation of reasoning paradigms, spanning direct single-model generation, CoT-augmented single-model reasoning, and representative MAS.<n>We introduce MIMeBench, a new open-ended benchmark that targets two foundational yet underexplored semantic capabilities.
arXiv Detail & Related papers (2026-01-19T17:23:45Z)
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning [103.7657839292775]
ARM-Thinker is an Agentic multimodal Reward Model that autonomously invokes external tools to ground judgments in verifiable evidence.<n>We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy.<n>Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
arXiv Detail & Related papers (2025-12-04T18:59:52Z)
Agentic AI Reasoning for Mobile Edge General Intelligence: Fundamentals, Approaches, and Directions [74.35421055079655]
Large language models (LLMs) have enabled an emergence of agentic artificial intelligence (AI) with powerful reasoning and autonomous decision-making capabilities.<n>Mobile Edge General Intelligence (MEGI) brings real-time, privacy-preserving reasoning to the network edge.<n>We propose a joint optimization framework for efficient LLM reasoning deployment in MEGI.
arXiv Detail & Related papers (2025-09-27T10:53:48Z)
RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment [20.416910591388618]
We introduce RefactorCoderQA, a benchmark designed to evaluate and enhance the performance of Large Language Models (LLMs) across coding tasks.<n>Our fine-tuned model, RefactorCoder-MoE, achieves state-of-the-art performance, significantly outperforming leading open-source and commercial baselines with an overall accuracy of 76.84%.
arXiv Detail & Related papers (2025-09-12T17:44:22Z)
TableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning [10.267950603662776]
TableMind is a tool-integrated table reasoning agent that autonomously performs multi-turn tool invocation, writes and executes code in a secure sandbox environment for data analysis and precise numerical reasoning.<n>To realize these capabilities, we adopt a two-stage fine-tuning paradigm built on top of a powerful pre-trained language model.
arXiv Detail & Related papers (2025-09-08T02:00:31Z)
How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench [58.114899897566964]
In a multi-turn conversational environment, large language models (LLMs) often struggle with consistent reasoning and adherence to domain-specific policies.<n>We propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules.<n>IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively.
arXiv Detail & Related papers (2025-08-28T15:57:33Z)
MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use [72.53177559476704]
We introduce MCPVerse, a real-world benchmark for evaluating agentic tool use.<n> MCPVerse integrates more than 550 real-world, executable tools to create an unprecedented action space exceeding 140k tokens.<n>We benchmarked the state-of-the-art LLMs across three modes (Oracle, Standard, and Max-Scale)
arXiv Detail & Related papers (2025-08-22T09:47:53Z)
Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research [32.92036657863354]
Language agents powered by large language models (LLMs) have demonstrated remarkable capabilities in understanding, reasoning, and executing complex tasks.<n>However, developing robust agents presents significant challenges: substantial engineering overhead, lack of standardized components, and insufficient evaluation frameworks for fair comparison.<n>We introduce Agent Graph-based Orchestration for Reasoning and Assessment (AGORA), a flexible and abstraction framework that addresses these challenges.
arXiv Detail & Related papers (2025-05-30T08:46:23Z)
General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z)
Computational Reasoning of Large Language Models [51.629694188014064]
We introduce textbfTuring Machine Bench, a benchmark to assess the ability of Large Language Models (LLMs) to execute reasoning processes.<n> TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a theoretical foundation based on Turing machine.
arXiv Detail & Related papers (2025-04-29T13:52:47Z)
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.<n>AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.<n>This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.