Classifier-Augmented Generation for Structured Workflow Prediction
- URL: http://arxiv.org/abs/2510.12825v1
- Date: Fri, 10 Oct 2025 18:38:25 GMT
- Title: Classifier-Augmented Generation for Structured Workflow Prediction
- Authors: Thomas Gschwind, Shramona Chakraborty, Nitin Gupta, Sameep Mehta,
- Abstract summary: We propose a system that translates natural language descriptions into executables.<n>It automatically predicts both the structure and detailed configuration of the flow.<n>This is the first system with a detailed evaluation across stage prediction, edge layout, and property generation for natural-driven authoring.
- Score: 5.92079054629498
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: ETL (Extract, Transform, Load) tools such as IBM DataStage allow users to visually assemble complex data workflows, but configuring stages and their properties remains time consuming and requires deep tool knowledge. We propose a system that translates natural language descriptions into executable workflows, automatically predicting both the structure and detailed configuration of the flow. At its core lies a Classifier-Augmented Generation (CAG) approach that combines utterance decomposition with a classifier and stage-specific few-shot prompting to produce accurate stage predictions. These stages are then connected into non-linear workflows using edge prediction, and stage properties are inferred from sub-utterance context. We compare CAG against strong single-prompt and agentic baselines, showing improved accuracy and efficiency, while substantially reducing token usage. Our architecture is modular, interpretable, and capable of end-to-end workflow generation, including robust validation steps. To our knowledge, this is the first system with a detailed evaluation across stage prediction, edge layout, and property generation for natural-language-driven ETL authoring.
Related papers
- Operationalization of Machine Learning with Serverless Architecture: An Industrial Operationalization of Machine Learning with Serverless Architecture: An Industrial Implementation for Harmonized System Code Prediction [0.0]
This paper presents a serverless MLOps framework orchestrating the complete ML lifecycle from data ingestion, training, deployment, monitoring, and retraining to using event-driven pipelines and managed services.<n>We demonstrate practical applicability through an industrial implementation for Harmonized System (HS) code prediction, a compliance-critical task where short, unstructured product descriptions are mapped to standardized codes used by customs authorities in global trade.<n>Our solution uses a custom text embedding multiple deep learning architectures, with Text-CNN achieving 98 percent accuracy on ground truth data.
arXiv Detail & Related papers (2026-02-19T05:59:55Z) - Autonomous Data Processing using Meta-Agents [2.3732259124656907]
We present textbfAutonomous Data Processing using Meta-agents (ADP-MA), a framework that dynamically constructs, executes, and iteratively refines data processing pipelines.<n> ADP-MA emphasizes context-aware optimization, adaptive workload partitioning, and progressive sampling for scalability.<n>We demonstrate ADP-MA through an interactive demo that showcases pipeline construction, execution monitoring, and adaptive refinement across representative data processing tasks.
arXiv Detail & Related papers (2026-01-30T20:58:17Z) - Cluster Workload Allocation: Semantic Soft Affinity Using Natural Language Processing [0.0]
This paper introduces a semantic, intent-driven scheduling paradigm for cluster systems using Natural Language Processing.<n>The system employs a Large Language Cluster Model (LLM) integrated via a scheduler extender to interpret natural language allocation hint annotations for soft affinity preferences.
arXiv Detail & Related papers (2026-01-14T08:36:21Z) - Eliminating Agentic Workflow for Introduction Generation with Parametric Stage Tokens [3.6588919376939733]
We propose eliminating external agentic to write research introductions.<n>Instead, we parameterize their logical structure into a large language model.<n>This allows the generation of a complete introduction in a single inference.
arXiv Detail & Related papers (2025-12-28T12:51:36Z) - How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis [0.0]
Despite its significance, tokenization in the context of assembly code remains an underexplored area.<n>We explore preprocessing customization options and pre-tokenization rules tailored to the unique characteristics of assembly code.<n>We compare tokenizers based on tokenization efficiency, vocabulary compression, and representational fidelity for assembly code.
arXiv Detail & Related papers (2025-11-05T19:45:26Z) - A Framework for Quantifying How Pre-Training and Context Benefit In-Context Learning [52.07397258423034]
We propose a new framework to analyze the ICL performance in a class of realistic settings.<n>We derive the precise relationship between ICL performance, context length and the KL divergence between pre-train and query task distribution.
arXiv Detail & Related papers (2025-10-26T09:21:29Z) - Context-level Language Modeling by Learning Predictive Context Embeddings [79.00607069677393]
We introduce textbfContextLM, a framework that augments standard pretraining with an inherent textbfnext-context prediction objective.<n>This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks.<n>Experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance.
arXiv Detail & Related papers (2025-10-23T07:09:45Z) - ContextNav: Towards Agentic Multimodal In-Context Learning [85.05420047017513]
ContextNav is an agentic framework that integrates the scalability of automated retrieval with the quality and adaptiveness of human-like curation.<n>It builds a resource-aware multimodal embedding pipeline, maintains a retrievable vector database, and applies agentic retrieval and structural alignment to construct noise-resilient contexts.<n> Experimental results demonstrate that ContextNav achieves state-of-the-art performance across various datasets.
arXiv Detail & Related papers (2025-10-06T07:49:52Z) - AgenticIE: An Adaptive Agent for Information Extraction from Complex Regulatory Documents [1.338174941551702]
Declaration of Performance (DoP) documents, mandated by EU regulation, certify the performance of construction products.<n>There are two challenges to make DoPs machine and human accessible through automated key-value pair extraction (KVP) and question answering (QA)
arXiv Detail & Related papers (2025-09-15T10:53:05Z) - Leveraging Machine Learning and Enhanced Parallelism Detection for BPMN Model Generation from Text [75.77648333476776]
This paper introduces an automated pipeline for extracting BPMN models from text.<n>A key contribution of this work is the introduction of a newly annotated dataset.<n>We augment the dataset with 15 newly annotated documents containing 32 parallel gateways for model training.
arXiv Detail & Related papers (2025-07-11T07:25:55Z) - Dynamic Chunking for End-to-End Hierarchical Sequence Modeling [17.277753030570263]
We introduce techniques that enable a dynamic chunking mechanism which automatically learns content- and context- dependent segmentation strategies.<n> incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization-LM-detokenization pipeline with a single model learned fully end-to-end.<n>Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction.<n>H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without anys or explicit supervision.
arXiv Detail & Related papers (2025-07-10T17:39:37Z) - Agentic Predictor: Performance Prediction for Agentic Workflows via Multi-View Encoding [56.565200973244146]
Agentic Predictor is a lightweight predictor for efficient agentic workflow evaluation.<n>By learning to approximate task success rates, Agentic Predictor enables fast and accurate selection of optimal agentic workflow configurations.
arXiv Detail & Related papers (2025-05-26T09:46:50Z) - Fine-tuning a Large Language Model for Automating Computational Fluid Dynamics Simulations [11.902947290205645]
Large language models (LLMs) have advanced scientific computing, their use in CFD is automating.<n>We introduce a novel approach centered on domain-specific LLM adaptation.<n>A multi-agent framework orchestrates the process, autonomously verifying inputs, generating configurations, running simulations, and correcting errors.
arXiv Detail & Related papers (2025-04-13T14:35:30Z) - Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorfBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures.<n>We also present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms.<n>We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.