An Index-based Approach for Efficient and Effective Web Content Extraction
- URL: http://arxiv.org/abs/2512.06641v1
- Date: Sun, 07 Dec 2025 03:18:19 GMT
- Title: An Index-based Approach for Efficient and Effective Web Content Extraction
- Authors: Yihan Chen, Benfeng Xu, Xiaorui Wang, Zhendong Mao,
- Abstract summary: We introduce Index-based Web Content Extraction.<n>We partition HTML into structure-aware, addressable segments, and extract only the positional indices of content relevant to a given query.<n>This method decouples extraction latency from content length, enabling rapid, query-relevant extraction.
- Score: 38.40209116782093
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As web agents (e.g., Deep Research) routinely consume massive volumes of web pages to gather and analyze information, LLM context management -- under large token budgets and low signal density -- emerges as a foundational, high-importance, and technically challenging problem for agentic and RAG pipelines. Existing solutions for extracting relevant content are inadequate: generative extraction models suffer from high latency, rule-based heuristics lack adaptability, and chunk-and-rerank methods are blind to webpage structure. To overcome these issues, we introduce Index-based Web Content Extraction to reframe the extraction process from slow, token-by-token generation into a highly efficient, discriminative task of index prediction, achieving both effectiveness and efficiency. We partition HTML into structure-aware, addressable segments, and extract only the positional indices of content relevant to a given query. This method decouples extraction latency from content length, enabling rapid, query-relevant extraction. We first evaluate our method as a post-retrieval processing component within an RAG QA system and find that it improves QA accuracy. Then we directly measure its match rate with the target content in two scenarios: main content extraction (ME) and query-relevant extraction (QE). Experimental results show that our method outperforms existing works in both accuracy and speed, effectively bridging the gap between LLMs and the vast webpages.
Related papers
- Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining [78.36592534300839]
We show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance.<n>This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71%.
arXiv Detail & Related papers (2026-02-23T06:41:57Z) - SimpleMem: Efficient Lifelong Memory for LLM Agents [73.74399447715052]
We introduce SimpleMem, an efficient memory framework based on semantic lossless compression.<n>We propose a three-stage pipeline designed to maximize information density and token utilization.<n> Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost.
arXiv Detail & Related papers (2026-01-05T21:02:49Z) - URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding [55.45331924836242]
We present URaG, a framework that Unifies Retrieval and Generation within a single MLLM.<n>We show that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%.
arXiv Detail & Related papers (2025-11-13T17:54:09Z) - FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents [76.12500510390439]
Web agents powered by large language models (LLMs) must process lengthy web page observations to complete user goals.<n>Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction.<n>We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations.
arXiv Detail & Related papers (2025-10-03T17:41:30Z) - Careful Queries, Credible Results: Teaching RAG Models Advanced Web Search Tools with Reinforcement Learning [48.46951981642895]
We propose WebFilter, a novel RAG framework that generates source-restricted queries and filters out unreliable content.<n>We show that WebFilter improves answer quality and retrieval precision, outperforming existing RAG methods on both in-domain and out-of-domain benchmarks.
arXiv Detail & Related papers (2025-08-11T13:08:37Z) - QExplorer: Large Language Model Based Query Extraction for Toxic Content Exploration [13.481570152219502]
This study proposes QExplorer, an approach of large language model based Query Extraction for toxic content Exploration.<n>The offline empirical results demonstrate that the performance of our automatic query extraction outperforms that of several LLMs and humans.<n>The online deployment shows a significant increase in the detection of toxic items.
arXiv Detail & Related papers (2025-02-06T06:11:58Z) - QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory [75.81394991657545]
We introduce information bottleneck theory (IB) to model the problem.<n>We propose a cross-attention-based approach to approximate mutual information in IB.<n>Our method achieves a 25% increase in compression rate compared to the state-of-the-art.
arXiv Detail & Related papers (2024-08-20T02:44:45Z) - EWEK-QA: Enhanced Web and Efficient Knowledge Graph Retrieval for Citation-based Question Answering Systems [103.91826112815384]
citation-based QA systems are suffering from two shortcomings.
They usually rely only on web as a source of extracted knowledge and adding other external knowledge sources can hamper the efficiency of the system.
We propose our enhanced web and efficient knowledge graph (KG) retrieval solution (EWEK-QA) to enrich the content of the extracted knowledge fed to the system.
arXiv Detail & Related papers (2024-06-14T19:40:38Z) - TSTEM: A Cognitive Platform for Collecting Cyber Threat Intelligence in the Wild [0.06597195879147556]
The extraction of cyber threat intelligence (CTI) from open sources is a rapidly expanding defensive strategy.
Previous research has focused on improving individual components of the extraction process.
The community lacks open-source platforms for deploying streaming CTI data pipelines in the wild.
arXiv Detail & Related papers (2024-02-15T14:29:21Z) - Effective and Efficient Query-aware Snippet Extraction for Web Search [61.60405035952961]
We propose an effective query-aware webpage snippet extraction method named DeepQSE.
DeepQSE first learns query-aware sentence representations for each sentence to capture the fine-grained relevance between query and sentence.
We propose an efficient version of DeepQSE, named Efficient-DeepQSE, which can significantly improve the inference speed of DeepQSE without affecting its performance.
arXiv Detail & Related papers (2022-10-17T07:46:17Z) - Knowledge-guided Open Attribute Value Extraction with Reinforcement
Learning [23.125544502927482]
We propose a knowledge-guided reinforcement learning (RL) framework for open attribute value extraction.
We trained a deep Q-network to sequentially compare extracted answers to improve extraction accuracy.
Our results show that our method outperforms the baselines by 16.5 - 27.8%.
arXiv Detail & Related papers (2020-10-19T03:28:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.