GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems
- URL: http://arxiv.org/abs/2507.13190v1
- Date: Thu, 17 Jul 2025 14:59:20 GMT
- Title: GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems
- Authors: Jisoo Lee, Raeyoung Chang, Dongwook Kwon, Harmanpreet Singh, Nikhil Verma,
- Abstract summary: We introduce GEMMAS, a graph-based evaluation framework that analyzes the internal collaboration process by modeling agent interactions as a directed acyclic graph.<n>To capture collaboration quality, we propose two process-level metrics: Information Diversity Score (IDS) to measure semantic variation in inter-agent messages, and Unnecessary Path Ratio (UPR) to quantify redundant reasoning paths.<n>We evaluate GEMMAS across five benchmarks and highlight results on GSM8K, where systems with only a 2.1% difference in accuracy differ by 12.8% in IDS and 80% in UPR, revealing substantial variation in internal collaboration.
- Score: 1.7825757481227436
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multi-agent systems built on language models have shown strong performance on collaborative reasoning tasks. However, existing evaluations focus only on the correctness of the final output, overlooking how inefficient communication and poor coordination contribute to redundant reasoning and higher computational costs. We introduce GEMMAS, a graph-based evaluation framework that analyzes the internal collaboration process by modeling agent interactions as a directed acyclic graph. To capture collaboration quality, we propose two process-level metrics: Information Diversity Score (IDS) to measure semantic variation in inter-agent messages, and Unnecessary Path Ratio (UPR) to quantify redundant reasoning paths. We evaluate GEMMAS across five benchmarks and highlight results on GSM8K, where systems with only a 2.1% difference in accuracy differ by 12.8% in IDS and 80% in UPR, revealing substantial variation in internal collaboration. These findings demonstrate that outcome-only metrics are insufficient for evaluating multi-agent performance and highlight the importance of process-level diagnostics in designing more interpretable and resource-efficient collaborative AI systems.
Related papers
- A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition [14.242680363313148]
Whole-body biometric recognition is a challenging task that integrates various biometric modalities.<n>We present textbfQuality-guided textbfMixture of score-fusion textbfExperts (QME)<n>We introduce a novel pseudo-quality loss for quality estimation with a modality-specific Quality Estimator (QE) and a score triplet loss to improve the metric performance.
arXiv Detail & Related papers (2025-07-31T18:00:01Z) - The Optimization Paradox in Clinical AI Multi-Agent Systems [13.177792688650971]
The relationship between component-level optimization and system-wide performance remains poorly understood.<n>We evaluated this relationship using 2,400 real patient cases from the MIMIC-CDM dataset.<n>Our results reveal a paradox: while multi-agent systems generally outperformed single agents, the component-optimized or Best of Breed system with superior components and excellent process metrics significantly underperformed in diagnostic accuracy (67.7% vs. 77.4% for a top multi-agent system)
arXiv Detail & Related papers (2025-06-06T23:01:51Z) - On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents [59.825725526176655]
Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents.<n>Existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition.<n>We introduce MultiAgentBench, a benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios.
arXiv Detail & Related papers (2025-03-03T05:18:50Z) - A Cooperative Multi-Agent Framework for Zero-Shot Named Entity Recognition [71.61103962200666]
Zero-shot named entity recognition (NER) aims to develop entity recognition systems from unannotated text corpora.<n>Recent work has adapted large language models (LLMs) for zero-shot NER by crafting specialized prompt templates.<n>We introduce the cooperative multi-agent system (CMAS), a novel framework for zero-shot NER.
arXiv Detail & Related papers (2025-02-25T23:30:43Z) - GCM-Net: Graph-enhanced Cross-Modal Infusion with a Metaheuristic-Driven Network for Video Sentiment and Emotion Analysis [2.012311338995539]
This paper presents a novel framework that leverages the multi-modal contextual information from utterances and applies metaheuristic algorithms to learn for utterance-level sentiment and emotion prediction.
To show the effectiveness of our approach, we have conducted extensive evaluations on three prominent multimodal benchmark datasets.
arXiv Detail & Related papers (2024-10-02T10:07:48Z) - Enhancing Heterogeneous Multi-Agent Cooperation in Decentralized MARL via GNN-driven Intrinsic Rewards [1.179778723980276]
Multi-agent Reinforcement Learning (MARL) is emerging as a key framework for sequential decision-making and control tasks.
The deployment of these systems in real-world scenarios often requires decentralized training, a diverse set of agents, and learning from infrequent environmental reward signals.
We propose the CoHet algorithm, which utilizes a novel Graph Neural Network (GNN) based intrinsic motivation to facilitate the learning of heterogeneous agent policies.
arXiv Detail & Related papers (2024-08-12T21:38:40Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Open-Domain Text Evaluation via Contrastive Distribution Methods [75.59039812868681]
We introduce a novel method for evaluating open-domain text generation called Contrastive Distribution Methods.
Our experiments on coherence evaluation for multi-turn dialogue and commonsense evaluation for controllable generation demonstrate CDM's superior correlate with human judgment.
arXiv Detail & Related papers (2023-06-20T20:37:54Z) - Quality-Based Conditional Processing in Multi-Biometrics: Application to
Sensor Interoperability [63.05238390013457]
We describe and evaluate the ATVS-UAM fusion approach submitted to the quality-based evaluation of the 2007 BioSecure Multimodal Evaluation Campaign.
Our approach is based on linear logistic regression, in which fused scores tend to be log-likelihood-ratios.
Results show that the proposed approach outperforms all the rule-based fusion schemes.
arXiv Detail & Related papers (2022-11-24T12:11:22Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - Graph Convolutional Value Decomposition in Multi-Agent Reinforcement
Learning [9.774412108791218]
We propose a novel framework for value function factorization in deep reinforcement learning.
In particular, we consider the team of agents as the set of nodes of a complete directed graph.
We introduce a mixing GNN module, which is responsible for i) factorizing the team state-action value function into individual per-agent observation-action value functions, and ii) explicit credit assignment to each agent in terms of fractions of the global team reward.
arXiv Detail & Related papers (2020-10-09T18:01:01Z) - Dif-MAML: Decentralized Multi-Agent Meta-Learning [54.39661018886268]
We propose a cooperative multi-agent meta-learning algorithm, referred to as MAML or Dif-MAML.
We show that the proposed strategy allows a collection of agents to attain agreement at a linear rate and to converge to a stationary point of the aggregate MAML.
Simulation results illustrate the theoretical findings and the superior performance relative to the traditional non-cooperative setting.
arXiv Detail & Related papers (2020-10-06T16:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.