Related papers: When KV Cache Reuse Fails in Multi-Agent Systems: Cross-Candidate Interaction is Crucial for LLM Judges

When KV Cache Reuse Fails in Multi-Agent Systems: Cross-Candidate Interaction is Crucial for LLM Judges

URL: http://arxiv.org/abs/2601.08343v1
Date: Tue, 13 Jan 2026 09:02:58 GMT
Title: When KV Cache Reuse Fails in Multi-Agent Systems: Cross-Candidate Interaction is Crucial for LLM Judges
Authors: Sichu Liang, Zhenglin Wang, Jiajia Chu, Pengfei Xia, Hui Zang, Deyu Zhou,
Abstract summary: We show that efficiency gains do not transfer uniformly to judge-centric inference.<n>Across GSM8K, MMLU, and HumanEval, we find that reuse strategies that are effective for execution agents can severely perturb judge behavior.
Score: 26.22728953485589
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-agent LLM systems routinely generate multiple candidate responses that are aggregated by an LLM judge. To reduce the dominant prefill cost in such pipelines, recent work advocates KV cache reuse across partially shared contexts and reports substantial speedups for generation agents. In this work, we show that these efficiency gains do not transfer uniformly to judge-centric inference. Across GSM8K, MMLU, and HumanEval, we find that reuse strategies that are effective for execution agents can severely perturb judge behavior: end-task accuracy may appear stable, yet the judge's selection becomes highly inconsistent with dense prefill. We quantify this risk using Judge Consistency Rate (JCR) and provide diagnostics showing that reuse systematically weakens cross-candidate attention, especially for later candidate blocks. Our ablation further demonstrates that explicit cross-candidate interaction is crucial for preserving dense-prefill decisions. Overall, our results identify a previously overlooked failure mode of KV cache reuse and highlight judge-centric inference as a distinct regime that demands dedicated, risk-aware system design.

Related papers

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction [24.416258744287166]
ICON is a probing-to-mitigation framework that neutralizes attacks while preserving task continuity.<n>ICON achieves a competitive 0.4% ASR, matching commercial grade detectors, while yielding a over 50% task utility gain.
arXiv Detail & Related papers (2026-02-24T09:13:05Z)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
Re-Rankers as Relevance Judges [65.37611299805856]
We reproduce re-rankers in a re-ranker-as-relevance-judge setup.<n>We perform experiments on TREC-DL 2019 to 2023 with 8 re-rankers from 3 families, ranging from 220M to 32B, and analyse the evaluation bias exhibited by re-ranker-based judges.
arXiv Detail & Related papers (2026-01-08T00:02:59Z)
MIRAGE: Misleading Retrieval-Augmented Generation via Black-box and Query-agnostic Poisoning Attacks [47.46936341268548]
Retrieval-Augmented Generation (RAG) systems introduce a critical attack surface: corpus poisoning.<n>We propose MIRAGE, a novel multi-stage poisoning pipeline designed for strict black-box and query-agnostic environments.<n>Extensive experiments demonstrate that MIRAGE significantly outperforms existing baselines in both attack efficacy and stealthiness.
arXiv Detail & Related papers (2025-12-09T06:38:16Z)
Multi-Agent Debate for LLM Judges with Adaptive Stability Detection [46.67172123607961]
We propose a multi-agent debate judge framework where agents collaboratively reason and iteratively refine their responses.<n>We formalize the debate process mathematically, analyzing agent interactions and proving that debate amplifies correctness compared to static ensembles.<n> Experiments across multiple benchmarks and models demonstrate that our framework improves judgment accuracy over majority voting while maintaining computational efficiency.
arXiv Detail & Related papers (2025-10-14T16:30:30Z)
Toward an Unbiased Collective Memory for Efficient LLM-Based Agentic 6G Cross-Domain Management [1.9188126920097714]
This paper introduces a novel framework for proactive cross-domain resource orchestration in 6G RAN-Edge networks.<n>The system comprises specialized RAN (energy efficiency) and Edge (latency assurance) agents that engage in iterative negotiation.<n>Agents interact with a digital twin to test their proposals and leverage a long-term collective memory.
arXiv Detail & Related papers (2025-09-30T12:57:11Z)
Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z)
Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z)
Are We There Yet? A Decision Framework for Replacing Term Based Retrieval with Dense Retrieval Systems [35.77217529138364]
Several dense retrieval (DR) models have demonstrated competitive performance to term-based retrieval. DR projects queries and documents into a dense vector space and retrieves results via (approximate) nearest neighbor search. It is impossible to predict whether DR will become ubiquitous in the future, but one way this is possible is through repeated applications of decision processes.
arXiv Detail & Related papers (2022-06-26T23:16:05Z)
Multi-Expert Adversarial Attack Detection in Person Re-identification Using Context Inconsistency [47.719533482898306]
We propose a Multi-Expert Adversarial Attack Detection (MEAAD) approach to detect malicious attacks on person re-identification (ReID) systems. As the first adversarial attack detection approach for ReID,MEAADeffectively detects various adversarial at-tacks and achieves high ROC-AUC (over 97.5%).
arXiv Detail & Related papers (2021-08-23T01:59:09Z)
Transferable, Controllable, and Inconspicuous Adversarial Attacks on Person Re-identification With Deep Mis-Ranking [83.48804199140758]
We propose a learning-to-mis-rank formulation to perturb the ranking of the system output. We also perform a back-box attack by developing a novel multi-stage network architecture. Our method can control the number of malicious pixels by using differentiable multi-shot sampling.
arXiv Detail & Related papers (2020-04-08T18:48:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.