Related papers: Assessing Web Search Credibility and Response Groundedness in Chat Assistants

Assessing Web Search Credibility and Response Groundedness in Chat Assistants

URL: http://arxiv.org/abs/2510.13749v1
Date: Wed, 15 Oct 2025 16:55:47 GMT
Title: Assessing Web Search Credibility and Response Groundedness in Chat Assistants
Authors: Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, Marián Šimko,
Abstract summary: We introduce a novel methodology for evaluating assistants' web search behavior.<n>Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat.
Score: 4.0127354590894955
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also raises the risk of amplifying misinformation from low-credibility sources. In this paper, we introduce a novel methodology for evaluating assistants' web search behavior, focusing on source credibility and the groundedness of responses with respect to cited sources. Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat. Our findings reveal differences between the assistants, with Perplexity achieving the highest source credibility, whereas GPT-4o exhibits elevated citation of non-credibility sources on sensitive topics. This work provides the first systematic comparison of commonly used chat assistants for fact-checking behavior, offering a foundation for evaluating AI systems in high-stakes information environments.

Related papers

SourceBench: Can AI Answers Reference Quality Web Sources? [14.668125843739423]
SourceBench is a benchmark for measuring the quality of cited web sources across 100 real-world queries.<n>We evaluate eight large language models (LLMs), Google Search, and three AI search tools over 3996 cited sources using SourceBench.
arXiv Detail & Related papers (2026-02-18T23:15:32Z)
Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond [3.615835506868351]
We focus on assessing the overall ability of ChatGPT in 4 different medical information extraction (MedIE) tasks across 6 benchmark datasets.<n>We present the systematically analysis by measuring ChatGPT's performance, explainability, confidence, faithfulness, and uncertainty.
arXiv Detail & Related papers (2026-01-29T14:16:51Z)
OpenNovelty: An LLM-powered Agentic System for Verifiable Scholarly Novelty Assessment [63.662126457336534]
OpenNovelty is an agentic system for transparent, evidence-based novelty analysis.<n>It grounds all assessments in retrieved real papers, ensuring verifiable judgments.<n>OpenNovelty aims to empower the research community with a scalable tool that promotes fair, consistent, and evidence-backed peer review.
arXiv Detail & Related papers (2026-01-04T15:48:51Z)
PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading [24.52586571116556]
Large Language Models (LLMs) increasingly serve as research assistants, yet their reliability in scholarly tasks remains under-evaluated.<n>In this work, we introduce PaperAsk, a benchmark that systematically evaluates LLMs across four key research tasks.<n>We find consistent reliability failures: citation retrieval fails in 48-98% of multi-reference queries, section-specific content extraction fails in 72-91% of cases, and topical paper discovery yields F1 scores below 0.32, missing over 60% of relevant literature.
arXiv Detail & Related papers (2025-10-25T10:11:29Z)
LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild [86.6586720134927]
LiveResearchBench is a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia.<n>DeepEval is a comprehensive suite covering both content- and report-level quality.<n>Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.
arXiv Detail & Related papers (2025-10-16T02:49:16Z)
VeriCite: Towards Reliable Citations in Retrieval-Augmented Generation via Rigorous Verification [107.75781898355562]
We introduce a novel framework, called VeriCite, designed to rigorously validate supporting evidence and enhance answer attribution.<n>We conduct experiments across five open-source LLMs and four datasets, demonstrating that VeriCite can significantly improve citation quality while maintaining the correctness of the answers.
arXiv Detail & Related papers (2025-10-13T13:38:54Z)
DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence [50.97612134791782]
Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices.<n>We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations.
arXiv Detail & Related papers (2025-09-02T00:32:38Z)
Generative AI Search Engines as Arbiters of Public Knowledge: An Audit of Bias and Authority [2.860575804107195]
This paper reports on an audit study of generative AI systems (ChatGPT, Bing Chat, and Perplexity) which investigates how these new search engines construct responses. We collected system responses using a set of 48 authentic queries for 4 topics over a 7-day period and analyzed the data using sentiment analysis, inductive coding and source classification. Results provide an overview of the nature of system responses across these systems and provide evidence of sentiment bias based on the queries and topics, and commercial and geographic bias in sources.
arXiv Detail & Related papers (2024-05-22T22:09:32Z)
Evaluating Verifiability in Generative Search Engines [70.59477647085387]
Generative search engines directly generate responses to user queries, along with in-line citations. We conduct human evaluation to audit four popular generative search engines. We find that responses from existing generative search engines are fluent and appear informative, but frequently contain unsupported statements and inaccurate citations.
arXiv Detail & Related papers (2023-04-19T17:56:12Z)
chatClimate: Grounding Conversational AI in Climate Science [9.043032065867536]
Large Language Models (LLMs) still face two major challenges: hallucination and outdated information after the training phase. We present our conversational AI prototype, available at www.chatclimate.ai, and demonstrate its ability to answer challenging questions accurately. The answers and their sources were evaluated by our team of IPCC authors, who used their expert knowledge to score the accuracy of the answers from 1 (very-low) to 5 (very-high)
arXiv Detail & Related papers (2023-04-11T21:31:39Z)
To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection. We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains. Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.