Related papers: An Index-based Approach for Efficient and Effective Web Content Extraction

An Index-based Approach for Efficient and Effective Web Content Extraction

URL: http://arxiv.org/abs/2512.06641v1
Date: Sun, 07 Dec 2025 03:18:19 GMT
Title: An Index-based Approach for Efficient and Effective Web Content Extraction
Authors: Yihan Chen, Benfeng Xu, Xiaorui Wang, Zhendong Mao,
Abstract summary: We introduce Index-based Web Content Extraction.<n>We partition HTML into structure-aware, addressable segments, and extract only the positional indices of content relevant to a given query.<n>This method decouples extraction latency from content length, enabling rapid, query-relevant extraction.
Score: 38.40209116782093
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As web agents (e.g., Deep Research) routinely consume massive volumes of web pages to gather and analyze information, LLM context management -- under large token budgets and low signal density -- emerges as a foundational, high-importance, and technically challenging problem for agentic and RAG pipelines. Existing solutions for extracting relevant content are inadequate: generative extraction models suffer from high latency, rule-based heuristics lack adaptability, and chunk-and-rerank methods are blind to webpage structure. To overcome these issues, we introduce Index-based Web Content Extraction to reframe the extraction process from slow, token-by-token generation into a highly efficient, discriminative task of index prediction, achieving both effectiveness and efficiency. We partition HTML into structure-aware, addressable segments, and extract only the positional indices of content relevant to a given query. This method decouples extraction latency from content length, enabling rapid, query-relevant extraction. We first evaluate our method as a post-retrieval processing component within an RAG QA system and find that it improves QA accuracy. Then we directly measure its match rate with the target content in two scenarios: main content extraction (ME) and query-relevant extraction (QE). Experimental results show that our method outperforms existing works in both accuracy and speed, effectively bridging the gap between LLMs and the vast webpages.

Related papers

Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining [78.36592534300839]
We show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance.<n>This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71%.
arXiv Detail & Related papers (2026-02-23T06:41:57Z)
SimpleMem: Efficient Lifelong Memory for LLM Agents [73.74399447715052]
We introduce SimpleMem, an efficient memory framework based on semantic lossless compression.<n>We propose a three-stage pipeline designed to maximize information density and token utilization.<n> Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost.
arXiv Detail & Related papers (2026-01-05T21:02:49Z)
URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding [55.45331924836242]
We present URaG, a framework that Unifies Retrieval and Generation within a single MLLM.<n>We show that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%.
arXiv Detail & Related papers (2025-11-13T17:54:09Z)
FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents [76.12500510390439]
Web agents powered by large language models (LLMs) must process lengthy web page observations to complete user goals.<n>Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction.<n>We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations.
arXiv Detail & Related papers (2025-10-03T17:41:30Z)
Careful Queries, Credible Results: Teaching RAG Models Advanced Web Search Tools with Reinforcement Learning [48.46951981642895]
We propose WebFilter, a novel RAG framework that generates source-restricted queries and filters out unreliable content.<n>We show that WebFilter improves answer quality and retrieval precision, outperforming existing RAG methods on both in-domain and out-of-domain benchmarks.
arXiv Detail & Related papers (2025-08-11T13:08:37Z)
QExplorer: Large Language Model Based Query Extraction for Toxic Content Exploration [13.481570152219502]
This study proposes QExplorer, an approach of large language model based Query Extraction for toxic content Exploration.<n>The offline empirical results demonstrate that the performance of our automatic query extraction outperforms that of several LLMs and humans.<n>The online deployment shows a significant increase in the detection of toxic items.
arXiv Detail & Related papers (2025-02-06T06:11:58Z)
QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory [75.81394991657545]
We introduce information bottleneck theory (IB) to model the problem.<n>We propose a cross-attention-based approach to approximate mutual information in IB.<n>Our method achieves a 25% increase in compression rate compared to the state-of-the-art.
arXiv Detail & Related papers (2024-08-20T02:44:45Z)
EWEK-QA: Enhanced Web and Efficient Knowledge Graph Retrieval for Citation-based Question Answering Systems [103.91826112815384]
citation-based QA systems are suffering from two shortcomings. They usually rely only on web as a source of extracted knowledge and adding other external knowledge sources can hamper the efficiency of the system. We propose our enhanced web and efficient knowledge graph (KG) retrieval solution (EWEK-QA) to enrich the content of the extracted knowledge fed to the system.
arXiv Detail & Related papers (2024-06-14T19:40:38Z)
TSTEM: A Cognitive Platform for Collecting Cyber Threat Intelligence in the Wild [0.06597195879147556]
The extraction of cyber threat intelligence (CTI) from open sources is a rapidly expanding defensive strategy. Previous research has focused on improving individual components of the extraction process. The community lacks open-source platforms for deploying streaming CTI data pipelines in the wild.
arXiv Detail & Related papers (2024-02-15T14:29:21Z)
Effective and Efficient Query-aware Snippet Extraction for Web Search [61.60405035952961]
We propose an effective query-aware webpage snippet extraction method named DeepQSE. DeepQSE first learns query-aware sentence representations for each sentence to capture the fine-grained relevance between query and sentence. We propose an efficient version of DeepQSE, named Efficient-DeepQSE, which can significantly improve the inference speed of DeepQSE without affecting its performance.
arXiv Detail & Related papers (2022-10-17T07:46:17Z)
Knowledge-guided Open Attribute Value Extraction with Reinforcement Learning [23.125544502927482]
We propose a knowledge-guided reinforcement learning (RL) framework for open attribute value extraction. We trained a deep Q-network to sequentially compare extracted answers to improve extraction accuracy. Our results show that our method outperforms the baselines by 16.5 - 27.8%.
arXiv Detail & Related papers (2020-10-19T03:28:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.