Evaluating Verifiability in Generative Search Engines
- URL: http://arxiv.org/abs/2304.09848v2
- Date: Mon, 23 Oct 2023 19:11:38 GMT
- Title: Evaluating Verifiability in Generative Search Engines
- Authors: Nelson F. Liu and Tianyi Zhang and Percy Liang
- Abstract summary: Generative search engines directly generate responses to user queries, along with in-line citations.
We conduct human evaluation to audit four popular generative search engines.
We find that responses from existing generative search engines are fluent and appear informative, but frequently contain unsupported statements and inaccurate citations.
- Score: 70.59477647085387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative search engines directly generate responses to user queries, along
with in-line citations. A prerequisite trait of a trustworthy generative search
engine is verifiability, i.e., systems should cite comprehensively (high
citation recall; all statements are fully supported by citations) and
accurately (high citation precision; every cite supports its associated
statement). We conduct human evaluation to audit four popular generative search
engines -- Bing Chat, NeevaAI, perplexity.ai, and YouChat -- across a diverse
set of queries from a variety of sources (e.g., historical Google user queries,
dynamically-collected open-ended questions on Reddit, etc.). We find that
responses from existing generative search engines are fluent and appear
informative, but frequently contain unsupported statements and inaccurate
citations: on average, a mere 51.5% of generated sentences are fully supported
by citations and only 74.5% of citations support their associated sentence. We
believe that these results are concerningly low for systems that may serve as a
primary tool for information-seeking users, especially given their facade of
trustworthiness. We hope that our results further motivate the development of
trustworthy generative search engines and help researchers and users better
understand the shortcomings of existing commercial systems.
Related papers
- Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses [32.49468716515915]
Large Language Model (LLM)-based applications are graduating from research prototypes to products serving millions of users.
A prominent example is the appearance of Answer Engines: LLM-based generative search engines supplanting traditional search engines.
arXiv Detail & Related papers (2024-10-15T00:50:31Z) - WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations [34.99831757956635]
We formulate the task of attributed query-focused summarization (AQFS) and present WebCiteS, a Chinese dataset featuring 7k human-annotated summaries with citations.
We tackle these issues by developing detailed metrics and enabling the automatic evaluator to decompose the sentences into sub-claims for fine-grained verification.
arXiv Detail & Related papers (2024-03-04T07:06:41Z) - Evaluating Robustness of Generative Search Engine on Adversarial Factual Questions [89.35345649303451]
Generative search engines have the potential to transform how people seek information online.
But generated responses from existing large language models (LLMs)-backed generative search engines may not always be accurate.
Retrieval-augmented generation exacerbates safety concerns, since adversaries may successfully evade the entire system.
arXiv Detail & Related papers (2024-02-25T11:22:19Z) - Evaluating Generative Ad Hoc Information Retrieval [58.800799175084286]
generative retrieval systems often directly return a grounded generated text as a response to a query.
Quantifying the utility of the textual responses is essential for appropriately evaluating such generative ad hoc retrieval.
arXiv Detail & Related papers (2023-11-08T14:05:00Z) - Social Commonsense-Guided Search Query Generation for Open-Domain
Knowledge-Powered Conversations [66.16863141262506]
We present a novel approach that focuses on generating internet search queries guided by social commonsense.
Our proposed framework addresses passive user interactions by integrating topic tracking, commonsense response generation and instruction-driven query generation.
arXiv Detail & Related papers (2023-10-22T16:14:56Z) - Enabling Large Language Models to Generate Text with Citations [37.64884969997378]
Large language models (LLMs) have emerged as a widely-used tool for information seeking.
Our aim is to allow LLMs to generate text with citations, improving their factual correctness and verifiability.
We propose ALCE, the first benchmark for Automatic LLMs' Citation Evaluation.
arXiv Detail & Related papers (2023-05-24T01:53:49Z) - WebCPM: Interactive Web Search for Chinese Long-form Question Answering [104.676752359777]
Long-form question answering (LFQA) aims at answering complex, open-ended questions with detailed, paragraph-length responses.
We introduce WebCPM, the first Chinese LFQA dataset.
We collect 5,500 high-quality question-answer pairs, together with 14,315 supporting facts and 121,330 web search actions.
arXiv Detail & Related papers (2023-05-11T14:47:29Z) - Search-Engine-augmented Dialogue Response Generation with Cheaply
Supervised Query Production [98.98161995555485]
We propose a dialogue model that can access the vast and dynamic information from any search engine for response generation.
As the core module, a query producer is used to generate queries from a dialogue context to interact with a search engine.
Experiments show that our query producer can achieve R@1 and R@5 rates of 62.4% and 74.8% for retrieving gold knowledge.
arXiv Detail & Related papers (2023-02-16T01:58:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.