Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles
- URL: http://arxiv.org/abs/2602.01590v2
- Date: Tue, 03 Feb 2026 06:51:16 GMT
- Title: Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles
- Authors: Shaohan Wang, Benfeng Xu, Licheng Zhang, Mingxuan Du, Chiwei Zhu, Xiaorui Wang, Zhendong Mao, Yongdong Zhang,
- Abstract summary: We introduce Wiki Live Challenge (WLC), a live benchmark that leverages the newest Wikipedia Good Articles (GAs) as expert-level references.<n>We propose Wiki Eval, a comprehensive evaluation framework comprising a fine-grained evaluation method with 39 criteria for writing quality and rigorous metrics for factual verifiability.
- Score: 56.724847946825285
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep Research Agents (DRAs) have demonstrated remarkable capabilities in autonomous information retrieval and report generation, showing great potential to assist humans in complex research tasks. Current evaluation frameworks primarily rely on LLM-generated references or LLM-derived evaluation dimensions. While these approaches offer scalability, they often lack the reliability of expert-verified content and struggle to provide objective, fine-grained assessments of critical dimensions. To bridge this gap, we introduce Wiki Live Challenge (WLC), a live benchmark that leverages the newest Wikipedia Good Articles (GAs) as expert-level references. Wikipedia's strict standards for neutrality, comprehensiveness, and verifiability serve as a great challenge for DRAs, with GAs representing the pinnacle of which. We curate a dataset of 100 recent Good Articles and propose Wiki Eval, a comprehensive evaluation framework comprising a fine-grained evaluation method with 39 criteria for writing quality and rigorous metrics for factual verifiability. Extensive experiments on various DRA systems demonstrate a significant gap between current DRAs and human expert-level Wikipedia articles, validating the effectiveness of WLC in advancing agent research. We release our benchmark at https://github.com/WangShao2000/Wiki_Live_Challenge
Related papers
- DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing [53.85037373860246]
We introduce Deep Synth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities.<n>We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization)<n>Our results demonstrate that agentic plan-and-write significantly outperform single-turn generation.
arXiv Detail & Related papers (2026-01-07T03:07:52Z) - CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection [60.52240468810558]
We introduce CoCoNUTS, a content-oriented benchmark built upon a fine-grained dataset of AI-generated peer reviews.<n>We also develop CoCoDet, an AI review detector via a multi-task learning framework, to achieve more accurate and robust detection of AI involvement in review content.
arXiv Detail & Related papers (2025-08-28T06:03:11Z) - Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models [2.0861090421004937]
Large Language Models (LLMs) are trained on vast and diverse internet corpora that often include inaccurate or misleading content.<n>This review systematically analyzes how LLM-generated content is evaluated for factual accuracy.
arXiv Detail & Related papers (2025-08-05T19:20:05Z) - Bidirectional LMs are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection [48.188285483378664]
We introduce a novel, real-world and large-scale knowledge injection benchmark that evolves continuously over time without requiring human intervention.<n>We propose WikiDYK, which leverages recently-added and human-written facts from Wikipedia's "Did You Know..." entries.<n>WikiDYK contains 12,290 facts and 77,180 questions, which is also seamlessly with future updates from Wikipedia editors.
arXiv Detail & Related papers (2025-05-18T08:39:05Z) - WikiBigEdit: Understanding the Limits of Lifelong Knowledge Editing in LLMs [66.51203413345773]
We bridge research into lifelong knowledge editing to real-world edits at a practically relevant scale.<n>We first introduce WikiBigEdit; a large-scale benchmark of real-world Wikidata edits.<n>In its first instance, it includes over 500K question-answer pairs for knowledge editing.
arXiv Detail & Related papers (2025-03-07T18:45:42Z) - HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits [92.62157408704594]
HelloFresh is based on continuous streams of real-world data generated by intrinsically motivated human labelers.
It covers recent events from X (formerly Twitter) community notes and edits of Wikipedia pages.
It mitigates the risk of test data contamination and benchmark overfitting.
arXiv Detail & Related papers (2024-06-05T16:25:57Z) - WIKIGENBENCH: Exploring Full-length Wikipedia Generation under Real-World Scenario [32.28150998156827]
WIKIGENBENCH is a new benchmark consisting of 1,320 entries.<n>For generation, we explore a real-world scenario where structured, full-length Wikipedia articles with citations are generated for new events using input documents from web sources.<n>For evaluation, we integrate systematic metrics and LLM-based metrics to assess the verifiability, organization, and other aspects aligned with real-world scenarios.
arXiv Detail & Related papers (2024-02-28T11:51:56Z) - Longitudinal Assessment of Reference Quality on Wikipedia [7.823541290904653]
This work analyzes the reliability of this global encyclopedia through the lens of its references.
We operationalize the notion of reference quality by defining reference need (RN), i.e., the percentage of sentences missing a citation, and reference risk (RR), i.e., the proportion of non-authoritative references.
arXiv Detail & Related papers (2023-03-09T13:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.