Related papers: JudgeFlow: Agentic Workflow Optimization via Block Judge

JudgeFlow: Agentic Workflow Optimization via Block Judge

URL: http://arxiv.org/abs/2601.07477v1
Date: Mon, 12 Jan 2026 12:30:14 GMT
Title: JudgeFlow: Agentic Workflow Optimization via Block Judge
Authors: Zihan Ma, Zhikai Zhao, Chuanbo Hua, Federico Berto, Jinkyoo Park,
Abstract summary: Current methods rely on coarse, end-to-end evaluation signals and lack fine-grained signals on where to refine, often resulting in inefficient or low-impact modifications.<n>We propose our, an Evaluation-Judge-Optimization-Update pipeline that captures fundamental forms of logic and assigns rank-based responsibility scores to problematic blocks.<n>Our approach improves sample efficiency, enhances interpretability through block-level diagnostics, and provides a scalable foundation for automating increasingly complex agentic.
Score: 25.427646436735312
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Optimizing LLM-based agentic workflows is challenging for scaling AI capabilities. Current methods rely on coarse, end-to-end evaluation signals and lack fine-grained signals on where to refine, often resulting in inefficient or low-impact modifications. To address these limitations, we propose {\our{}}, an Evaluation-Judge-Optimization-Update pipeline. We incorporate reusable, configurable logic blocks into agentic workflows to capture fundamental forms of logic. On top of this abstraction, we design a dedicated Judge module that inspects execution traces -- particularly failed runs -- and assigns rank-based responsibility scores to problematic blocks. These fine-grained diagnostic signals are then leveraged by an LLM-based optimizer, which focuses modifications on the most problematic block in the workflow. Our approach improves sample efficiency, enhances interpretability through block-level diagnostics, and provides a scalable foundation for automating increasingly complex agentic workflows. We evaluate {\our{}} on mathematical reasoning and code generation benchmarks, where {\our{}} achieves superior performance and efficiency compared to existing methods. The source code is publicly available at https://github.com/ma-zihan/JudgeFlow.

Related papers

Sherlock: Reliable and Efficient Agentic Workflow Execution [44.30588192569476]
Large language models (LLM) are increasingly replacing traditional applications.<n>Incorrect or partially correct output at one step can propagate or even amplify through subsequent stages.<n> verifying every step introduces significant latency and cost overheads.<n>Our solution, Sherlock, addresses these using counterfactual analysis on agentic to identify error-prone nodes and selectively attaching cost-optimal verifiers.
arXiv Detail & Related papers (2025-11-01T00:17:57Z)
DyFlow: Dynamic Workflow Framework for Agentic Reasoning [79.19799197382478]
DyFlow is a dynamic workflow generation framework that adaptively constructs and adjusts reasoning procedures based on task requirements and real-time intermediate feedback.<n>We systematically evaluate DyFlow across diverse domains, including social reasoning, biomedical tasks, mathematical problem solving, and code generation.<n>Results demonstrate that DyFlow significantly outperforms existing baselines, achieving substantial Pass@k improvements and exhibiting robust generalization across diverse domains.
arXiv Detail & Related papers (2025-09-30T10:36:23Z)
Blueprint First, Model Second: A Framework for Deterministic LLM Workflow [3.9886771197662925]
We introduce the Source Code Agent framework, a new paradigm built on the "Blueprint First, Model Second" philosophy.<n>Our framework decouples the workflow logic from the generative model.<n>Our work enables the verifiable and reliable deployment of autonomous agents in applications governed by strict procedural logic.
arXiv Detail & Related papers (2025-08-01T03:10:00Z)
GNNs as Predictors of Agentic Workflow Performances [48.34485750450876]
Agentic invoked by Large Language Models (LLMs) have achieved remarkable success in handling complex tasks.<n>This paper formulates agentic as computational graphs and advocates Graph Neural Networks (GNNs) as efficient predictors of agentic performances.<n>We construct FLORA-Bench, a unified platform for benchmarking GNNs for predicting agentic workflow performances.
arXiv Detail & Related papers (2025-03-14T11:11:00Z)
ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization [51.280919773837645]
We develop ScoreFlow, a high-performance framework for agent workflow optimization.<n>ScoreFlow incorporates Score-DPO, a novel variant of the direct preference optimization method that accounts for quantitative feedback.<n>It achieves an 8.2% improvement over existing baselines across question answering, coding, and mathematical reasoning.
arXiv Detail & Related papers (2025-02-06T18:47:49Z)
Flow: Modularized Agentic Workflow Automation [53.073598156915615]
Multi-agent frameworks powered by large language models (LLMs) have demonstrated great success in automated planning and task execution.<n>However, the effective adjustment of agentic during execution has not been well studied.<n>In this paper, we define an activity-on-vertex (AOV) graph, which allows continuous workflow refinement by agents.<n>Our proposed multi-agent framework achieves efficient concurrent execution of subtasks, effective goal achievement, and enhanced error tolerance.
arXiv Detail & Related papers (2025-01-14T04:35:37Z)
AFlow: Automating Agentic Workflow Generation [36.61172223528231]
Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains.<n>We introduce AFlow, an automated framework that efficiently explores this space using Monte Carlo Tree Search.<n> Empirical evaluations across six benchmark datasets demonstrate AFlow's efficacy, yielding a 5.7% average improvement over state-of-the-art baselines.
arXiv Detail & Related papers (2024-10-14T17:40:40Z)
Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorfBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures.<n>We also present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms.<n>We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z)
Self Normalizing Flows [65.73510214694987]
We propose a flexible framework for training normalizing flows by replacing expensive terms in the gradient by learned approximate inverses at each layer. This reduces the computational complexity of each layer's exact update from $mathcalO(D3)$ to $mathcalO(D2)$. We show experimentally that such models are remarkably stable and optimize to similar data likelihood values as their exact gradient counterparts.
arXiv Detail & Related papers (2020-11-14T09:51:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.