Related papers: Routine: A Structural Planning Framework for LLM Agent System in Enterprise

Routine: A Structural Planning Framework for LLM Agent System in Enterprise

URL: http://arxiv.org/abs/2507.14447v2
Date: Tue, 22 Jul 2025 10:01:32 GMT
Title: Routine: A Structural Planning Framework for LLM Agent System in Enterprise
Authors: Guancheng Zeng, Xueyi Chen, Jiawang Hu, Shaohua Qi, Yaxuan Mao, Zhantao Wang, Yifan Nie, Shuang Li, Qiuyang Feng, Pengxu Qiu, Yujia Wang, Wenqiang Han, Linyan Huang, Gang Li, Jingjing Mo, Haowen Hu,
Abstract summary: The deployment of agent systems in an enterprise environment is often hindered by several challenges.<n>Common models lack domain-specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability.<n>This paper introduces Routine, a multi-step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter passing.
Score: 10.989149053905587
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The deployment of agent systems in an enterprise environment is often hindered by several challenges: common models lack domain-specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability. To address this, this paper introduces Routine, a multi-step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter passing to guide the agent's execution module in performing multi-step tool-calling tasks with high stability. In evaluations conducted within a real-world enterprise scenario, Routine significantly increases the execution accuracy in model tool calls, increasing the performance of GPT-4o from 41.1% to 96.3%, and Qwen3-14B from 32.6% to 83.3%. We further constructed a Routine-following training dataset and fine-tuned Qwen3-14B, resulting in an accuracy increase to 88.2% on scenario-specific evaluations, indicating improved adherence to execution plans. In addition, we employed Routine-based distillation to create a scenario-specific, multi-step tool-calling dataset. Fine-tuning on this distilled dataset raised the model's accuracy to 95.5%, approaching GPT-4o's performance. These results highlight Routine's effectiveness in distilling domain-specific tool-usage patterns and enhancing model adaptability to new scenarios. Our experimental results demonstrate that Routine provides a practical and accessible approach to building stable agent workflows, accelerating the deployment and adoption of agent systems in enterprise environments, and advancing the technical vision of AI for Process.

Related papers

MagicAgent: Towards Generalized Agent Planning [73.21129030631421]
We present textbfMagicAgent, a series of foundation models specifically designed for generalized agent planning.<n>We introduce a lightweight and scalable synthetic data framework that generates high-quality trajectories across diverse planning tasks.<n>We show that MagicAgent-32B and MagicAgent-30B-A3B achieve superior performance across diverse open-source benchmarks.
arXiv Detail & Related papers (2026-02-22T01:39:16Z)
When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems [1.8717456484053328]
Multi-agent systems powered by large language models (LLMs) are transforming enterprise automation.<n>We introduce a comprehensive diagnostic framework that leverages big data analytics to evaluate procedural reliability in intelligent agent systems.<n>This work establishes foundational infrastructure for systematic reliability evaluation of tool-augmented AI systems.
arXiv Detail & Related papers (2026-01-22T19:24:21Z)
SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models [10.04930078540686]
SCRIBE is a reinforcement learning framework that intervenes at a novel mid-level abstraction.<n>It achieves state-of-the-art performance across a range of reasoning and tool-use benchmarks.
arXiv Detail & Related papers (2026-01-07T03:49:48Z)
JT-DA: Enhancing Data Analysis with Tool-Integrated Table Reasoning Large Language Models [58.408398005993455]
JT-DA-8B is a specialized large language model designed for complex table reasoning tasks across diverse real-world scenarios.<n>We construct a comprehensive and diverse training corpus with 34 well-defined table reasoning tasks, by aggregating 29 public table QA datasets and 3 million tables.<n> Experimental results show that JT-DA-8B achieves strong performance in various table reasoning tasks.
arXiv Detail & Related papers (2025-12-07T14:29:23Z)
Structured Uncertainty guided Clarification for LLM Agents [126.26213027785813]
LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures.<n>We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy.<n>Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$times$ compared to strong prompting and uncertainty-based baselines.
arXiv Detail & Related papers (2025-11-11T21:50:44Z)
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use [73.72524040856052]
AgentFlow is a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory.<n>Flow-GRPO tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates.<n>AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks.
arXiv Detail & Related papers (2025-10-07T05:32:44Z)
OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z)
NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset [16.676904484703]
We introduce NaturalGAIA, a novel benchmark engineered on the principle of Causal Pathways.<n>This paradigm structures complex tasks into a series of verifiable atomic steps, ensuring rigorous, fully automated, and reproducible standard for assessment.<n>We then utilize this dataset to perform Reinforcement FineTuning (RFT) on the Q2.5-VL-7B model.
arXiv Detail & Related papers (2025-08-02T11:53:41Z)
Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks [11.125564622217892]
Large Language Model agents improve by learning from their own successful experiences without human intervention.<n>Our method constructs and refines a database of self-generated trajectories that serve as in-context examples for future tasks.<n>Our trajectory bootstrapping technique demonstrates that agents can autonomously improve through experience, offering a scalable alternative to labor-intensive knowledge engineering.
arXiv Detail & Related papers (2025-05-01T00:48:12Z)
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)
An Innovative Data-Driven and Adaptive Reinforcement Learning Approach for Context-Aware Prescriptive Process Monitoring [3.4437362489150254]
We present a novel framework called Fine-Tuned Offline Reinforcement Learning Augmented Process Sequence Optimization.<n>FORLAPS aims to identify optimal execution paths in business processes by leveraging reinforcement learning enhanced with a state-dependent reward shaping mechanism.<n>We show that FORLAPS achieves 31% savings in resource time spent and a 23% reduction in process time span.
arXiv Detail & Related papers (2025-01-17T20:31:35Z)
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents [44.34340798542]
Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities. We propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions.
arXiv Detail & Related papers (2024-08-13T20:52:13Z)
Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement [50.481380478458945]
Iterative step-level Process Refinement (IPR) framework provides detailed step-by-step guidance to enhance agent training. Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines.
arXiv Detail & Related papers (2024-06-17T03:29:13Z)
Devil's Advocate: Anticipatory Reflection for LLM Agents [53.897557605550325]
Our approach prompts LLM agents to decompose a given task into manageable subtasks. We implement a three-fold introspective intervention:. Anticipatory reflection on potential failures and alternative remedy before action execution. Post-action alignment with subtask objectives and backtracking with remedy to ensure utmost effort in plan execution.
arXiv Detail & Related papers (2024-05-25T19:20:15Z)
DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning [56.887047551101574]
We present DS-Agent, a novel framework that harnesses large language models (LLMs) agent and case-based reasoning (CBR) In the development stage, DS-Agent follows the CBR framework to structure an automatic iteration pipeline, which can flexibly capitalize on the expert knowledge from Kaggle. In the deployment stage, DS-Agent implements a low-resource deployment stage with a simplified CBR paradigm, significantly reducing the demand on foundational capabilities of LLMs.
arXiv Detail & Related papers (2024-02-27T12:26:07Z)
Cumulative Reasoning with Large Language Models [12.267474250936123]
Cumulative Reasoning (CR) is a structured framework that enhances large language models (LLMs) problem-solving.<n>CR orchestrates LLMs in three distinct roles--Proposer, Verifier(s), and Reporter--to systematically decompose tasks, generate and validate intermediate reasoning steps, and compose them into a solution.
arXiv Detail & Related papers (2023-08-08T16:18:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.