Related papers: ChainBuddy: An AI Agent System for Generating LLM Pipelines

ChainBuddy: An AI Agent System for Generating LLM Pipelines

URL: http://arxiv.org/abs/2409.13588v2
Date: Sat, 08 Feb 2025 21:59:10 GMT
Title: ChainBuddy: An AI Agent System for Generating LLM Pipelines
Authors: Jingyue Zhang, Ian Arawjo,
Abstract summary: ChainBuddy is an AI workflow generation assistant built into the ChainForge platform.<n>From a single prompt or chat, ChainBuddy generates a starter evaluative pipeline in ChainForge aligned to the user's requirements.<n>We find that when using AI assistance, participants reported a less demanding workload, felt more confident, and produced higher quality pipelines evaluating LLM behavior.
Score: 2.7624021966289605
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) advance, their potential applications have grown significantly. However, it remains difficult to evaluate LLM behavior on user-defined tasks and craft effective pipelines to do so. Many users struggle with where to start, often referred to as the "blank page problem." ChainBuddy, an AI workflow generation assistant built into the ChainForge platform, aims to tackle this issue. From a single prompt or chat, ChainBuddy generates a starter evaluative LLM pipeline in ChainForge aligned to the user's requirements. ChainBuddy offers a straightforward and user-friendly way to plan and evaluate LLM behavior and make the process less daunting and more accessible across a wide range of possible tasks and use cases. We report a within-subjects user study comparing ChainBuddy to the baseline interface. We find that when using AI assistance, participants reported a less demanding workload, felt more confident, and produced higher quality pipelines evaluating LLM behavior. However, we also uncover a mismatch between subjective and objective ratings of performance: participants rated their successfulness similarly across conditions, while independent experts rated participant workflows significantly higher with AI assistance. Drawing connections to the Dunning-Kruger effect, we draw design implications for the future of workflow generation assistants to mitigate the risk of over-reliance.

Related papers

Prototypical Human-AI Collaboration Behaviors from LLM-Assisted Writing in the Wild [10.23533525266164]
Large language models (LLMs) are used in complex writing to steer generations to better fit their needs.<n>We conduct a large-scale analysis of this collaborative behavior for users engaged in writing tasks in the wild.<n>We identify prototypical behaviors in how users interact with LLMs in prompts following their original request.
arXiv Detail & Related papers (2025-05-21T21:13:01Z)
Thoughts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLMs through Agentic Pipelines [8.72060285437636]
Agentic pipelines present novel challenges and opportunities for human-centered explainability.<n>We present early findings from an agentic pipeline implementation of a perceptive task guidance system.
arXiv Detail & Related papers (2025-05-01T21:37:30Z)
Performant LLM Agentic Framework for Conversational AI [1.6114012813668932]
We introduce the Performant Agentic Framework (PAF), a novel system that assists Large Language Models (LLMs) in selecting appropriate nodes and executing actions in order when traversing complex graphs. PAF combines LLM-based reasoning with a mathematically grounded vector scoring mechanism, achieving both higher accuracy and reduced latency. Experiments demonstrate that PAF significantly outperforms baseline methods, paving the way for scalable, real-time Conversational AI systems in complex business environments.
arXiv Detail & Related papers (2025-03-09T02:58:34Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks. However, they still struggle with problems requiring multi-step decision-making and environmental feedback. We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs. LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing [70.25689961697523]
We propose a generalizable algorithm that enhances sequential reasoning by cross-task experience sharing and selection. Our work bridges the gap between existing sequential reasoning paradigms and validates the effectiveness of leveraging cross-task experiences.
arXiv Detail & Related papers (2024-10-22T03:59:53Z)
Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorfBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. We also present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms. We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z)
Automated test generation to evaluate tool-augmented LLMs as conversational AI agents [0.27309692684728615]
We present a test generation pipeline to evaluate conversational AI agents. Our framework uses LLMs to generate diverse tests grounded on user-defined procedures. Our results show that while tool-augmented LLMs perform well in single interactions, they often struggle to handle complete conversations.
arXiv Detail & Related papers (2024-09-24T09:57:43Z)
NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls [18.831512738668792]
We present NESTFUL, a benchmark to evaluate large language models (LLMs) on nested sequences of API calls. Our results show that most models do not perform well on nested APIs in NESTFUL as compared to their performance on the simpler problem settings available in existing benchmarks.
arXiv Detail & Related papers (2024-09-04T17:53:24Z)
CIBench: Evaluating Your LLMs with a Code Interpreter Plugin [68.95137938214862]
We propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks. The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions. We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
arXiv Detail & Related papers (2024-07-15T07:43:55Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users. We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
FlowMind: Automatic Workflow Generation with LLMs [12.848562107014093]
This paper introduces a novel approach, FlowMind, leveraging the capabilities of Large Language Models (LLMs) We propose a generic prompt recipe for a lecture that helps ground LLM reasoning with reliable Application Programming Interfaces (APIs) We also introduce NCEN-QA, a new dataset in finance for benchmarking question-answering tasks from N-CEN reports on funds.
arXiv Detail & Related papers (2024-03-17T00:36:37Z)
User-LLM: Efficient LLM Contextualization with User Embeddings [23.226164112909643]
User-LLM is a novel framework that leverages user embeddings to directly contextualize large language models with user history interactions. Our approach achieves significant efficiency gains by representing user timelines directly as embeddings, leading to substantial inference speedups of up to 78.1X.
arXiv Detail & Related papers (2024-02-21T08:03:27Z)
AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)
Enabling Intelligent Interactions between an Agent and an LLM: A Reinforcement Learning Approach [31.6589518077397]
Large language models (LLMs) encode a vast amount of world knowledge acquired from massive text datasets. LLMs can assist an embodied agent in solving complex sequential decision making tasks by providing high-level instructions. We propose When2Ask, a reinforcement learning based approach that learns when it is necessary to query LLMs for high-level instructions.
arXiv Detail & Related papers (2023-06-06T11:49:09Z)
Low-code LLM: Graphical User Interface over Large Language Models [115.08718239772107]
This paper introduces a novel human-LLM interaction framework, Low-code LLM. It incorporates six types of simple low-code visual programming interactions to achieve more controllable and stable responses. We highlight three advantages of the low-code LLM: user-friendly interaction, controllable generation, and wide applicability.
arXiv Detail & Related papers (2023-04-17T09:27:40Z)
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks. This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts [12.73129785710807]
We introduce the concept of Chaining LLM steps together, where the output of one step becomes the input for the next, thus aggregating the gains per step. In a 20-person user study, we found that Chaining not only improved the quality of task outcomes, but also significantly enhanced system transparency, controllability, and sense of collaboration.
arXiv Detail & Related papers (2021-10-04T19:59:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.