Related papers: Empirical Cumulative Distribution Function Clustering for LLM-based Agent System Analysis

Empirical Cumulative Distribution Function Clustering for LLM-based Agent System Analysis

URL: http://arxiv.org/abs/2602.16131v1
Date: Wed, 18 Feb 2026 01:49:35 GMT
Title: Empirical Cumulative Distribution Function Clustering for LLM-based Agent System Analysis
Authors: Chihiro Watanabe, Jingyu Sun,
Abstract summary: We propose a novel evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers.<n>Our experiments on a QA dataset demonstrate that ECDFs can distinguish between agent settings with similar final accuracies but different quality distributions.
Score: 3.8908016393731533
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly used as agents to solve complex tasks such as question answering (QA), scientific debate, and software development. A standard evaluation procedure aggregates multiple responses from LLM agents into a single final answer, often via majority voting, and compares it against reference answers. However, this process can obscure the quality and distributional characteristics of the original responses. In this paper, we propose a novel evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers. This enables a more nuanced assessment of response quality beyond exact match metrics. To analyze the response distributions across different agent configurations, we further introduce a clustering method for ECDFs using their distances and the $k$-medoids algorithm. Our experiments on a QA dataset demonstrate that ECDFs can distinguish between agent settings with similar final accuracies but different quality distributions. The clustering analysis also reveals interpretable group structures in the responses, offering insights into the impact of temperature, persona, and question topics.

Related papers

Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models [4.155649113742267]
ReQUESTA is a hybrid, multi-agent framework for generating cognitively diverse multiple-choice questions (MCQs)<n>We evaluated the framework in a large-scale reading comprehension study using academic expository texts.<n>Results showed that ReQUESTA-generated items were consistently more challenging, more discriminative, and more strongly aligned with overall reading comprehension performance.
arXiv Detail & Related papers (2026-02-03T16:26:47Z)
JAF: Judge Agent Forest [8.150475950851359]
We introduce JAF: Judge Agent Forest, a framework in which the judge agent conducts joint inference across a cohort of query--response pairs.<n>We develop a flexible locality-sensitive hashing algorithm that learns informative binary codes by integrating semantic embeddings.<n>We validate JAF with an empirical study on the demanding task of cloud misconfigs triage in large-scale cloud environments.
arXiv Detail & Related papers (2026-01-29T19:42:42Z)
Stochastic Self-Organization in Multi-Agent Systems [28.70691568233268]
Multi-agent systems (MAS) based on Large Language Models (LLMs) have the potential to solve tasks that are beyond the reach of any single LLM.<n>We introduce a response-conditioned framework that adapts communication on-the-fly.
arXiv Detail & Related papers (2025-10-01T09:08:04Z)
HypoGeneAgent: A Hypothesis Language Agent for Gene-Set Cluster Resolution Selection Using Perturb-seq Datasets [9.465853880657487]
Large-scale single-cell and Perturb-seq investigations routinely involve clustering cells.<n>We present HYPOGENEAGENT, a large language model (LLM)-driven framework, transforming cluster annotation into a quantitatively optimizable task.
arXiv Detail & Related papers (2025-09-10T22:25:33Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
ELOQ: Resources for Enhancing LLM Detection of Out-of-Scope Questions [52.33835101586687]
We study out-of-scope questions, where the retrieved document appears semantically similar to the question but lacks the necessary information to answer it.<n>We propose a guided hallucination-based approach ELOQ to automatically generate a diverse set of out-of-scope questions from post-cutoff documents.
arXiv Detail & Related papers (2024-10-18T16:11:29Z)
Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making [1.3812010983144802]
This study evaluates large language models (LLMs) across diverse domains, including cybersecurity, medicine, and finance. The results indicate that model size and types of prompts used for inference significantly influenced response length and quality.
arXiv Detail & Related papers (2024-06-25T20:52:31Z)
Multi-LLM QA with Embodied Exploration [55.581423861790945]
We investigate the use of Multi-Embodied LLM Explorers (MELE) for question-answering in an unknown environment. Multiple LLM-based agents independently explore and then answer queries about a household environment. We analyze different aggregation methods to generate a single, final answer for each query.
arXiv Detail & Related papers (2024-06-16T12:46:40Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)
An Empirical Comparison of LM-based Question and Answer Generation Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context. In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning. Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z)
Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models. In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z)
MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency in Summarization [55.60306377044225]
State-of-the-art summarization systems can generate highly fluent summaries. These summaries, however, may contain factual inconsistencies and/or information not present in the source. We introduce an alternative scheme based on standard information-theoretic measures in which the information present in the source and summary is directly compared.
arXiv Detail & Related papers (2023-01-28T23:08:25Z)
Query Focused Multi-Document Summarization with Distant Supervision [88.39032981994535]
Existing work relies heavily on retrieval-style methods for estimating the relevance between queries and text segments. We propose a coarse-to-fine modeling framework which introduces separate modules for estimating whether segments are relevant to the query. We demonstrate that our framework outperforms strong comparison systems on standard QFS benchmarks.
arXiv Detail & Related papers (2020-04-06T22:35:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.