SourceBench: Can AI Answers Reference Quality Web Sources?
- URL: http://arxiv.org/abs/2602.16942v1
- Date: Wed, 18 Feb 2026 23:15:32 GMT
- Title: SourceBench: Can AI Answers Reference Quality Web Sources?
- Authors: Hexi Jin, Stephen Liu, Yuheng Li, Simran Malik, Yiying Zhang,
- Abstract summary: SourceBench is a benchmark for measuring the quality of cited web sources across 100 real-world queries.<n>We evaluate eight large language models (LLMs), Google Search, and three AI search tools over 3996 cited sources using SourceBench.
- Score: 14.668125843739423
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents. SourceBench uses an eight-metric framework covering content quality (content relevance, factual accuracy, objectivity) and page-level signals (e.g., freshness, authority/accountability, clarity), and includes a human-labeled dataset with a calibrated LLM-based evaluator that matches expert judgments closely. We evaluate eight LLMs, Google Search, and three AI search tools over 3996 cited sources using SourceBench and conduct further experiments to understand the evaluation results. Overall, our work reveals four key new insights that can guide future research in the direction of GenAI and web search.
Related papers
- Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval [60.25608870901428]
Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs)<n>We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source robustness.
arXiv Detail & Related papers (2026-03-05T18:42:51Z) - Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning [49.559151128219725]
Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks.<n>However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness.<n>We propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs' context-awareness along three key facets.
arXiv Detail & Related papers (2025-11-13T08:13:23Z) - OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking [47.579237867766686]
OKBench is an agentic framework that automates the sourcing, creation, validation, and distribution of benchmarks.<n>Our results reveal distinct model behaviors when confronted with new information and highlight how retrieval narrows the performance gap between small and large models.
arXiv Detail & Related papers (2025-10-31T16:44:34Z) - Assessing Web Search Credibility and Response Groundedness in Chat Assistants [4.0127354590894955]
We introduce a novel methodology for evaluating assistants' web search behavior.<n>Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat.
arXiv Detail & Related papers (2025-10-15T16:55:47Z) - ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks [14.371010711040304]
ReportBench is a benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs)<n>Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports.
arXiv Detail & Related papers (2025-08-14T03:33:43Z) - A Knowledge Plug-and-Play Test Bed for Open-domain Dialogue Generation [51.31429493814664]
We present a benchmark named multi-source Wizard of Wikipedia for evaluating multi-source dialogue knowledge selection and response generation.
We propose a new challenge, dialogue knowledge plug-and-play, which aims to test an already trained dialogue model on using new support knowledge from previously unseen sources.
arXiv Detail & Related papers (2024-03-06T06:54:02Z) - WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations [34.99831757956635]
We formulate the task of attributed query-focused summarization (AQFS) and present WebCiteS, a Chinese dataset featuring 7k human-annotated summaries with citations.
We tackle these issues by developing detailed metrics and enabling the automatic evaluator to decompose the sentences into sub-claims for fine-grained verification.
arXiv Detail & Related papers (2024-03-04T07:06:41Z) - AlignBench: Benchmarking Chinese Alignment of Large Language Models [99.24597941555277]
We introduce AlignBench, a comprehensive benchmark for evaluating Chinese Large Language Models' alignment.
We design a human-in-the-loop data curation pipeline, containing eight main categories, 683 real-scenario rooted queries and corresponding human verified references.
For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judgecitezheng2023judging approach with Chain-of-Thought to generate explanations and final ratings.
arXiv Detail & Related papers (2023-11-30T17:41:30Z) - Know Where to Go: Make LLM a Relevant, Responsible, and Trustworthy
Searcher [10.053004550486214]
Large Language Models (LLMs) have shown the potential to improve relevance and provide direct answers in web searches.
challenges arise in the reliability of generated results and the credibility of contributing sources.
We propose a novel generative retrieval framework leveraging the knowledge of LLMs to foster a direct link between queries and online sources.
arXiv Detail & Related papers (2023-10-19T03:49:36Z) - Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution [48.86322922826514]
This paper defines a new task of Knowledge-aware Language Model Attribution (KaLMA)
First, we extend attribution source from unstructured texts to Knowledge Graph (KG), whose rich structures benefit both the attribution performance and working scenarios.
Second, we propose a new Conscious Incompetence" setting considering the incomplete knowledge repository.
Third, we propose a comprehensive automatic evaluation metric encompassing text quality, citation quality, and text citation alignment.
arXiv Detail & Related papers (2023-10-09T11:45:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.