Related papers: Look It Up: Analysing Internal Web Search Capabilities of Modern LLMs

Look It Up: Analysing Internal Web Search Capabilities of Modern LLMs

URL: http://arxiv.org/abs/2511.18931v1
Date: Mon, 24 Nov 2025 09:37:43 GMT
Title: Look It Up: Analysing Internal Web Search Capabilities of Modern LLMs
Authors: Sahil Kale,
Abstract summary: We introduce a benchmark evaluating the necessity and effectiveness of web access across commercial models with no access to internal states or parameters.<n>The dataset includes a static split of 783 temporally anchored questions answerable from pre-cutoff knowledge.<n>Web access substantially improves static accuracy for GPT-5-mini and Claude Haiku 4.5, though confidence calibration worsens.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Modern large language models integrate web search to provide real-time answers, yet it remains unclear whether they are efficiently calibrated to use search when it is actually needed. We introduce a benchmark evaluating both the necessity and effectiveness of web access across commercial models with no access to internal states or parameters. The dataset includes a static split of 783 temporally anchored questions answerable from pre-cutoff knowledge, aimed at testing whether models invoke search based on low internal confidence, and a dynamic split of 288 post-cutoff queries designed to test whether models recognise when search is required and retrieve updated information. Web access substantially improves static accuracy for GPT-5-mini and Claude Haiku 4.5, though confidence calibration worsens. On dynamic queries, both models frequently invoke search yet remain below 70 percent accuracy due to weak query formulation. Costs per accuracy-improving call remain low, but returns diminish once initial retrieval fails. Selective invocation helps, but models become overconfident and inconsistent after search. Overall, built-in web search meaningfully improves factual accuracy and can be invoked selectively, yet models remain overconfident, skip retrieval when it is essential, and falter once initial search queries underperform. Taken together, internal web search works better as a good low-latency verification layer than a reliable analytical tool, with clear room for improvement.

Related papers

Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search [54.987957691350665]
Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query.<n>Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications.<n>We propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search.
arXiv Detail & Related papers (2025-08-28T08:51:51Z)
RAVine: Reality-Aligned Evaluation for Agentic Search [7.4420114967110385]
RAVine is a Reality-Aligned eValuation framework for agentic LLMs with search.<n> RAVine targets multi-point queries and long-form answers that better reflect user intents.<n>We benchmark a series of models using RAVine and derive several insights.
arXiv Detail & Related papers (2025-07-22T16:08:12Z)
Maximally-Informative Retrieval for State Space Model Generation [59.954191072042526]
We introduce Retrieval In-Context Optimization (RICO) to minimize model uncertainty for a particular query at test-time.<n>Unlike traditional retrieval-augmented generation (RAG), which relies on externals for document retrieval, our approach leverages direct feedback from the model.<n>We show that standard top-$k$ retrieval with model gradients can approximate our optimization procedure, and provide connections to the leave-one-out loss.
arXiv Detail & Related papers (2025-06-13T18:08:54Z)
DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning [73.68685269970844]
We introduce WebPuzzle, a training and 275-sample test benchmark that evaluates information seeking on the live internet.<n>We develop DeepDiver, a reinforcement-learning framework that cultivates Search Intensity Scaling (SIS)-an emergent ability to escalate search frequency and depth.<n>We detail DeepDiver's curriculum from cold-start SFT to a well designed RL procedure, and show that its seeking policy generalized from closed-ended queries to open-ended generation such as long-form writing.
arXiv Detail & Related papers (2025-05-30T08:15:39Z)
SEM: Reinforcement Learning for Search-Efficient Large Language Models [26.075903427834838]
Large Language Models (LLMs) have demonstrated their capabilities not only in reasoning but also in invoking external tools.<n>Existing reinforcement learning approaches often lead to redundant search behaviors, resulting in inefficiencies and over-cost.<n>We propose SEM, a novel post-training reinforcement learning framework that explicitly trains LLMs to optimize search usage.
arXiv Detail & Related papers (2025-05-12T09:45:40Z)
Improving Retrieval in Sponsored Search by Leveraging Query Context Signals [6.152499434499752]
We propose an approach to enhance query understanding by augmenting queries with rich contextual signals. We use web search titles and snippets to ground queries in real-world information and utilize GPT-4 to generate query rewrites and explanations. Our context-aware approach substantially outperforms context-free models.
arXiv Detail & Related papers (2024-07-19T14:28:53Z)
Query-oriented Data Augmentation for Session Search [71.84678750612754]
We propose query-oriented data augmentation to enrich search logs and empower the modeling. We generate supplemental training pairs by altering the most important part of a search context. We develop several strategies to alter the current query, resulting in new training data with varying degrees of difficulty.
arXiv Detail & Related papers (2024-07-04T08:08:33Z)
Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box. This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z)
DSI++: Updating Transformer Memory with New Documents [95.70264288158766]
We introduce DSI++, a continual learning challenge for DSI to incrementally index new documents. We show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task.
arXiv Detail & Related papers (2022-12-19T18:59:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.