FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding
- URL: http://arxiv.org/abs/2602.09336v1
- Date: Tue, 10 Feb 2026 02:09:48 GMT
- Title: FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding
- Authors: Siyuan Huang, Ziyu Wang, Chao Pan, Han Zhao,
- Abstract summary: Existing language models struggle with Standard Operating Procedures ( SOPs)<n>We propose FM SO.P, solving these challenges through two novelties.<n>First, we introduce progressive task mixtures that build capabilities by stages across three task types with cumulative data.<n>Second, we propose an automatic multi-agent evaluation system consisting of three agents that adaptively generate rubrics, stratified test sets, and scoring, adapting to domains.
- Score: 19.90863270947963
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross-domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint reasoning. We propose FM SO.P, solving these challenges through two novelties. First, we introduce progressive task mixtures that build capabilities by stages across three task types with cumulative data: concept disambiguation for terminology precision, action sequence understanding for procedural correctness, and scenario-aware graph reasoning for conditional logic. Second, we propose an automatic multi-agent evaluation system consisting of three agents that adaptively generate rubrics, stratified test sets, and rubric scoring, adapting to domains (e.g., temporal constraints for DMV, regulatory compliance for banking). Evaluated on SOPBench across seven domains (Bank, DMV, Healthcare, Market, University, Library, Hotel), FM SO.P achieves 48.3\% pass rate with our 32B model and 34.3\% with our opensource 7B model, matching Qwen-2.5-72B-Instruct baseline (34.4\%) with 10x fewer parameters.
Related papers
- QP-OneModel: A Unified Generative LLM for Multi-Task Query Understanding in Xiaohongshu Search [19.376785819604923]
QP-OneModel bridges user intent and content supply in SNS search engines.<n>It generates intent descriptions as a novel high-fidelity semantic signal.<n>It also exhibits superior generalization, surpassing a 32B model by 7.60% accuracy on unseen tasks.
arXiv Detail & Related papers (2026-02-10T15:38:17Z) - Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z) - SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures [10.868853536476317]
Large language models (LLMs) are widely deployed as domain-specific agents.<n>We propose SOP-Maze, a benchmark constructed from real-world business data.<n>Extensive experiments reveal that nearly all state-of-the-art models struggle with SOP-Maze.
arXiv Detail & Related papers (2025-10-10T02:47:53Z) - Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts [75.20929587906228]
Large Language Model (LLM)-powered multi-agent systems (MAS) have rapidly advanced collaborative reasoning, tool use, and role-specialized coordination in complex tasks.<n>However, reliability-critical deployment remains hindered by a systemic failure mode: hierarchical compliance under instruction conflicts.
arXiv Detail & Related papers (2025-09-27T08:43:34Z) - Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z) - SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling [58.05959902776133]
We introduce Single-Pass.<n>with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation.<n>We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP)<n>On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $sim$16% of training samples compared to human-labeled and other synthetically trained baselines.
arXiv Detail & Related papers (2025-06-18T14:37:59Z) - SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z) - Recognizing and Splitting Conditional Sentences for Automation of
Business Processes Management [2.289790204910258]
We present our system that resolves an end-to-end problem consisting of 1) recognizing conditional sentences from technical documents, 2) finding boundaries to extract conditional and resultant clauses, and 3) categorizing resultant clause as Action or Consequence.
Our best model achieved very promising results of 83.82, 87.84, and 85.75 for extracting Condition, Action, and Consequence clauses.
arXiv Detail & Related papers (2021-04-01T17:53:16Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.