Related papers: JAF: Judge Agent Forest

JAF: Judge Agent Forest

URL: http://arxiv.org/abs/2601.22269v1
Date: Thu, 29 Jan 2026 19:42:42 GMT
Title: JAF: Judge Agent Forest
Authors: Sahil Garg, Brad Cheezum, Sridhar Dutta, Vishal Agarwal,
Abstract summary: We introduce JAF: Judge Agent Forest, a framework in which the judge agent conducts joint inference across a cohort of query--response pairs.<n>We develop a flexible locality-sensitive hashing algorithm that learns informative binary codes by integrating semantic embeddings.<n>We validate JAF with an empirical study on the demanding task of cloud misconfigs triage in large-scale cloud environments.
Score: 8.150475950851359
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Judge agents are fundamental to agentic AI frameworks: they provide automated evaluation, and enable iterative self-refinement of reasoning processes. We introduce JAF: Judge Agent Forest, a framework in which the judge agent conducts joint inference across a cohort of query--response pairs generated by a primary agent, rather than evaluating each in isolation. This paradigm elevates the judge from a local evaluator to a holistic learner: by simultaneously assessing related responses, the judge discerns cross-instance patterns and inconsistencies, whose aggregate feedback enables the primary agent to improve by viewing its own outputs through the judge's collective perspective. Conceptually, JAF bridges belief propagation and ensemble-learning principles: overlapping in-context neighborhoods induce a knowledge-graph structure that facilitates propagation of critique, and repeated, randomized evaluations yield a robust ensemble of context-sensitive judgments. JAF can be instantiated entirely via ICL, with the judge prompted for each query using its associated primary-agent response plus a small, possibly noisy set of peer exemplars. While kNN in embedding space is a natural starting point for exemplars, this approach overlooks categorical structure, domain metadata, or nuanced distinctions accessible to modern LLMs. To overcome these limitations, we develop a flexible locality-sensitive hashing (LSH) algorithm that learns informative binary codes by integrating semantic embeddings, LLM-driven hash predicates, supervision from categorical labels, and relevant side information. These hash codes support efficient, interpretable, and relation-aware selection of diverse exemplars, and further optimize exploration of CoT reasoning paths. We validate JAF with an empirical study on the demanding task of cloud misconfigs triage in large-scale cloud environments.

Related papers

LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval [74.72139580745511]
LaSER is a novel self-distillation framework that internalizes explicit reasoning into the latent space of retrievers.<n>Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.
arXiv Detail & Related papers (2026-03-02T04:11:18Z)
Empirical Cumulative Distribution Function Clustering for LLM-based Agent System Analysis [3.8908016393731533]
We propose a novel evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers.<n>Our experiments on a QA dataset demonstrate that ECDFs can distinguish between agent settings with similar final accuracies but different quality distributions.
arXiv Detail & Related papers (2026-02-18T01:49:35Z)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence [31.435770434219005]
Judge Decoding accelerates inference by relaxing the strict verification of Speculative Decoding.<n>In this work, we revisit this paradigm from first principles, revealing that the criticality'' scores learned via costly supervision are intrinsically encoded in the draft-target distributional divergence.
arXiv Detail & Related papers (2026-01-08T09:34:54Z)
FeClustRE: Hierarchical Clustering and Semantic Tagging of App Features from User Reviews [0.0]
FeClustRE is a framework integrating hybrid feature extraction, hierarchical clustering with auto-tuning and semantic labelling.<n>We evaluate FeClustRE on public benchmarks for extraction correctness and on a sample study of generative AI assistant app reviews for clustering quality, semantic coherence, and interpretability.
arXiv Detail & Related papers (2025-10-21T16:54:21Z)
LLM-guided Hierarchical Retrieval [54.73080745446999]
LATTICE is a hierarchical retrieval framework that enables an LLM to reason over and navigate large corpora with logarithmic search complexity.<n>A central challenge in such LLM-guided search is that the model's relevance judgments are noisy, context-dependent, and unaware of the hierarchy.<n>Our framework achieves state-of-the-art zero-shot performance on the reasoning-intensive BRIGHT benchmark.
arXiv Detail & Related papers (2025-10-15T07:05:17Z)
Automated Skill Decomposition Meets Expert Ontologies: Bridging the Granularity Gap with LLMs [1.2891210250935148]
This paper investigates automated skill decomposition using Large Language Models (LLMs)<n>Our framework standardizes the pipeline from prompting and generation to normalization and alignment with ontology nodes.<n>To evaluate outputs, we introduce two metrics: a F1-score that uses optimal embedding-based matching to assess content accuracy, and a hierarchy-aware F1-score that credits structurally correct placements to assess granularity.
arXiv Detail & Related papers (2025-10-13T12:03:06Z)
Towards Open-World Retrieval-Augmented Generation on Knowledge Graph: A Multi-Agent Collaboration Framework [21.896955284099334]
Large Language Models (LLMs) have demonstrated strong capabilities in language understanding and reasoning.<n>Retrieval-Augmented Generation (RAG) addresses this limitation by incorporating external knowledge sources.<n>We propose AnchorRAG, a novel multi-agent collaboration framework for open-world RAG without the predefined anchor entities.
arXiv Detail & Related papers (2025-09-01T08:26:12Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning [54.85131761693927]
We introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions.<n>Our core contribution lies in converting all judgment tasks for non-verifiable and verifiable prompts into a unified format with verifiable rewards.<n>We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance.
arXiv Detail & Related papers (2025-05-15T14:05:15Z)
JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance. We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods. In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z)
TestAgent: Automatic Benchmarking and Exploratory Interaction for Evaluating LLMs in Vertical Domains [19.492393243160244]
Large Language Models (LLMs) are increasingly deployed in highly specialized vertical domains.<n>Existing evaluations for vertical domains typically rely on the labor-intensive construction of static single-turn datasets.<n>We propose TestAgent, a framework for automatic benchmarking and exploratory dynamic evaluation in vertical domains.
arXiv Detail & Related papers (2024-10-15T11:20:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.