Related papers: Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks

Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks

URL: http://arxiv.org/abs/2602.06526v1
Date: Fri, 06 Feb 2026 09:27:03 GMT
Title: Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks
Authors: Minjeong Ban, Jeonghwan Choi, Hyangsuk Min, Nicole Hee-Yeon Kim, Minseok Kim, Jae-Gil Lee, Hwanjun Song,
Abstract summary: DREAM is a multi-round debate-based relevance assessment framework with LLM agents.<n>It achieves 95.2% labeling accuracy with only 3.5% human involvement.<n>BRIDGE is a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison.
Score: 31.017987800426894
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM-human hybrid strategies reduce costly human effort, they remain prone to LLM overconfidence and ineffective AI-to-human escalation. To address this, we propose DREAM, a multi-round debate-based relevance assessment framework with LLM agents, built on opposing initial stances and iterative reciprocal critique. Through our agreement-based debate, it yields more accurate labeling for certain cases and more reliable AI-to-human escalation for uncertain ones, achieving 95.2% labeling accuracy with only 3.5% human involvement. Using DREAM, we build BRIDGE, a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison by uncovering 29,824 missing relevant chunks. We then re-benchmark IR systems and extend evaluation to RAG, showing that unaddressed holes not only distort retriever rankings but also drive retrieval-generation misalignment. The relevance assessment framework is available at https: //github.com/DISL-Lab/DREAM-ICLR-26; and the BRIDGE dataset is available at https://github.com/DISL-Lab/BRIDGE-Benchmark.

Related papers

Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study [0.0]
Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models.<n>We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of large language models.
arXiv Detail & Related papers (2026-02-19T11:07:24Z)
With Argus Eyes: Assessing Retrieval Gaps via Uncertainty Scoring to Detect and Remedy Retrieval Blind Spots [10.538640148641532]
We show that neural retrievers have blind spots, which we define as the failure to retrieve entities that are relevant to the query, but have low similarity to the query embedding.<n>We investigate the training-induced biases that cause such blind spot entities to be mapped to inaccessible parts of the embedding space, resulting in low retrievability.<n>We introduce ARGUS, a pipeline that enables the retrievability of high-risk (low-RPS) entities through targeted document augmentation from a knowledge base (KB), first paragraphs of Wikipedia, in our case.
arXiv Detail & Related papers (2026-02-10T10:04:55Z)
Redefining Retrieval Evaluation in the Era of LLMs [20.75884808285362]
Traditional Information Retrieval (IR) metrics assume that human users sequentially examine documents with diminishing attention to lower ranks.<n>This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs)<n>We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones.
arXiv Detail & Related papers (2025-10-24T13:17:00Z)
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent [74.10138164281618]
BrowseComp-Plus is a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus.<n>This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods.
arXiv Detail & Related papers (2025-08-08T17:55:11Z)
What Has Been Lost with Synthetic Evaluation? [45.678729819785104]
Large language models (LLMs) are increasingly used for data generation.<n>We investigate whether LLMs can meet demands by generating reasoning over-text benchmarks.<n>We show that they are less challenging for LLMs than their human-authored counterparts.
arXiv Detail & Related papers (2025-05-28T20:12:32Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals [5.605770511387228]
RAGuard is the first benchmark to evaluate the robustness of RAG systems against misleading retrievals.<n>Unlike prior benchmarks that rely on synthetic noise, our fact-checking dataset captures naturally occurring misinformation.
arXiv Detail & Related papers (2025-02-22T05:50:15Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures [57.886592207948844]
We propose MixEval, a new paradigm for establishing efficient, gold-standard evaluation by strategically mixing off-the-shelf benchmarks. It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks.
arXiv Detail & Related papers (2024-06-03T05:47:05Z)
Benchmarking Large Language Models in Retrieval-Augmented Generation [53.504471079548]
We systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG. We establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese.
arXiv Detail & Related papers (2023-09-04T08:28:44Z)
Unsupervised Dense Retrieval with Relevance-Aware Contrastive Pre-Training [81.3781338418574]
We propose relevance-aware contrastive learning. We consistently improve the SOTA unsupervised Contriever model on the BEIR and open-domain QA retrieval benchmarks. Our method can not only beat BM25 after further pre-training on the target corpus but also serves as a good few-shot learner.
arXiv Detail & Related papers (2023-06-05T18:20:27Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.