LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology
- URL: http://arxiv.org/abs/2509.13978v2
- Date: Tue, 23 Sep 2025 13:31:18 GMT
- Title: LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology
- Authors: Renan Souza, Timothy Poteet, Brian Etz, Daniel Rosendo, Amal Gueroudji, Woong Shin, Prasanna Balaprakash, Rafael Ferreira da Silva,
- Abstract summary: We introduce an evaluation methodology, reference architecture, and open-source implementation that leverages interactive Large Language Model (LLM) agents for runtime data analysis.<n>Our approach uses a lightweight, metadata-driven design that translates natural language into structured provenance queries.<n> Evaluations across LLaMA, GPT, Gemini, and Claude, covering diverse query classes and a real-world chemistry workflow, show that modular design, prompt tuning, and Retrieval-Augmented Generation (RAG) enable accurate and insightful agent responses.
- Score: 3.470217255779291
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern scientific discovery increasingly relies on workflows that process data across the Edge, Cloud, and High Performance Computing (HPC) continuum. Comprehensive and in-depth analyses of these data are critical for hypothesis validation, anomaly detection, reproducibility, and impactful findings. Although workflow provenance techniques support such analyses, at large scale, the provenance data become complex and difficult to analyze. Existing systems depend on custom scripts, structured queries, or static dashboards, limiting data interaction. In this work, we introduce an evaluation methodology, reference architecture, and open-source implementation that leverages interactive Large Language Model (LLM) agents for runtime data analysis. Our approach uses a lightweight, metadata-driven design that translates natural language into structured provenance queries. Evaluations across LLaMA, GPT, Gemini, and Claude, covering diverse query classes and a real-world chemistry workflow, show that modular design, prompt tuning, and Retrieval-Augmented Generation (RAG) enable accurate and insightful LLM agent responses beyond recorded provenance.
Related papers
- Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs [66.63911043019294]
Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them.<n>This paper focuses on the use of LLM techniques to prepare data for diverse downstream tasks.<n>We introduce a task-centric taxonomy that organizes the field into three major tasks: data cleaning, standardization, error processing, imputation, data integration, and data enrichment.
arXiv Detail & Related papers (2026-01-22T12:02:45Z) - FROAV: A Framework for RAG Observation and Agent Verification - Lowering the Barrier to LLM Agent Research [0.5729426778193398]
We present FROAV, an open-source research platform that democratizes Large Language Models (LLMs) agent research.<n>FROAV implements a multi-stage Retrieval-Augmented Generation (RAG) pipeline and a rigorous "LLM-as-a-Judge" evaluation system.<n>Our framework integrates n8n for no-code workflow design, FastAPI for flexible backend logic, and Streamlit for human-in-the-loop interaction.
arXiv Detail & Related papers (2026-01-12T13:02:32Z) - Toward Automated and Trustworthy Scientific Analysis and Visualization with LLM-Generated Code [6.068120728706316]
Large language models (LLMs) offer a promising solution by generating code from natural language descriptions.<n>We construct a benchmark suite of domain-inspired prompts that reflect real-world research tasks.<n>Our findings show that, without human intervention, the reliability of LLM-generated code is limited.
arXiv Detail & Related papers (2025-11-26T21:27:03Z) - Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z) - LLM/Agent-as-Data-Analyst: A Survey [54.08761322298559]
Large language models (LLMs) and agent techniques have brought a fundamental shift in the functionality and development paradigm of data analysis tasks.<n>LLMs enable complex data understanding, natural language, semantic analysis functions, and autonomous pipeline orchestration.
arXiv Detail & Related papers (2025-09-28T17:31:38Z) - From Parameters to Performance: A Data-Driven Study on LLM Structure and Development [73.67759647072519]
Large language models (LLMs) have achieved remarkable success across various domains.<n>Despite the rapid growth in model scale and capability, systematic, data-driven research on how structural configurations affect performance remains scarce.<n>We present a large-scale dataset encompassing diverse open-source LLM structures and their performance across multiple benchmarks.
arXiv Detail & Related papers (2025-09-14T12:20:39Z) - Leveraging Knowledge Graphs and LLM Reasoning to Identify Operational Bottlenecks for Warehouse Planning Assistance [1.2749527861829046]
Our framework integrates Knowledge Graphs (KGs) and Large Language Model (LLM)-based agents.<n>It transforms raw DES data into a semantically rich KG, capturing relationships between simulation events and entities.<n>An LLM-based agent uses iterative reasoning, generating interdependent sub-questions. For each sub-question, it creates Cypher queries for KG interaction, extracts information, and self-reflects to correct errors.
arXiv Detail & Related papers (2025-07-23T07:18:55Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering [57.156093929365255]
Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents.<n>MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios.<n>Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-12T17:35:43Z) - AI-Driven Scholarly Peer Review via Persistent Workflow Prompting, Meta-Prompting, and Meta-Reasoning [0.0]
This report introduces Persistent Prompting (PWP), a potentially broadly applicable prompt engineering methodology.<n>We present a proof-of-concept PWP prompt for the critical analysis of experimental chemistry manuscripts.<n>We develop this PWP prompt through iterative application of meta-prompting techniques and meta-reasoning aimed at systematically codifying expert review.
arXiv Detail & Related papers (2025-05-06T09:06:18Z) - AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science [8.281093505963158]
Large language models (LLMs) are increasingly used to automate data analysis through executable code generation.<n>We present AIRepr, an Analyst-Inspector framework for automatically evaluating and improving the of LLM-generated data analysis.
arXiv Detail & Related papers (2025-02-23T01:15:50Z) - LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.