ChunkNorris: A High-Performance and Low-Energy Approach to PDF Parsing and Chunking
- URL: http://arxiv.org/abs/2602.00010v1
- Date: Fri, 28 Nov 2025 11:02:19 GMT
- Title: ChunkNorris: A High-Performance and Low-Energy Approach to PDF Parsing and Chunking
- Authors: Mathieu Ciancone, Clovis Varangot-Reille, Marion Schaeffer,
- Abstract summary: ChunkNorris is a novel-based technique designed to optimise the parsing and chunking of PDF documents.<n>We demonstrate efficiency of ChunkNorris through a benchmark against existing parsing and chunking methods.<n>This research highlights the potential of methods for real-world, resource-constrained RAG use cases.
- Score: 0.3811219334668792
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Retrieval-Augmented Generation applications, the Information Retrieval part is central as it provides the contextual information that enables a Large Language Model to generate an appropriate and truthful response. High quality parsing and chunking are critical as efficient data segmentation directly impacts downstream tasks, i.e. Information Retrieval and answer generation. In this paper, we introduce ChunkNorris, a novel heuristic-based technique designed to optimise the parsing and chunking of PDF documents. Our approach does not rely on machine learning and employs a suite of simple yet effective heuristics to achieve high performance with minimal computational overhead. We demonstrate the efficiency of ChunkNorris through a comprehensive benchmark against existing parsing and chunking methods, evaluating criteria such as execution time, energy consumption, and retrieval accuracy. We propose an open-access dataset to produce our results. ChunkNorris outperforms baseline and more advanced techniques, offering a practical and efficient alternative for Information Retrieval tasks. Therefore, this research highlights the potential of heuristic-based methods for real-world, resource-constrained RAG use cases.
Related papers
- Easy Data Unlearning Bench [53.1304932656586]
We introduce a unified and benchmarking suite that simplifies the evaluation of unlearning algorithms.<n>By standardizing setup and metrics, it enables reproducible, scalable, and fair comparison across unlearning methods.
arXiv Detail & Related papers (2026-02-18T12:20:32Z) - SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization [64.95852289011385]
Large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved.<n> evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs.<n>We propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection.
arXiv Detail & Related papers (2026-02-08T11:12:45Z) - SmartChunk Retrieval: Query-Aware Chunk Compression with Planning for Efficient Document RAG [41.16937860730275]
We present SmartChunk, a query-adaptive framework for efficient and robust long-document question answering (QA)<n>SmartChunk uses a planner that predicts the optimal chunk abstraction level for each query, and a lightweight compression module that produces high-level chunk embeddings without repeated summarization.<n>To reflect real-world applications, where users face diverse document types and query styles, we evaluate SmartChunk on five QA benchmarks plus one out-of-domain dataset.
arXiv Detail & Related papers (2025-12-17T01:21:44Z) - Tree-Based Text Retrieval via Hierarchical Clustering in RAGFrameworks: Application on Taiwanese Regulations [0.0]
We propose a hierarchical clustering-based retrieval method that eliminates the need to predefine k.<n>Our approach maintains the accuracy and relevance of system responses while adaptively selecting semantically relevant content.<n>Our framework is simple to implement and easily integrates with existing RAG pipelines, making it a practical solution for real-world applications under limited resources.
arXiv Detail & Related papers (2025-06-16T15:34:29Z) - Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation [80.69067017594709]
Large language models (LLMs) and their agentic counterparts struggle to retain reasoning from previous tasks.<n>We propose a novel framework, log-augmented generation (LAG) that directly reuses prior computation and reasoning from past logs at test time.<n>Our method significantly outperforms standard agentic systems that do not utilize logs.
arXiv Detail & Related papers (2025-05-20T14:14:38Z) - Effective Inference-Free Retrieval for Learned Sparse Representations [19.54810957623511]
Learned Sparse Retrieval (LSR) is an effective IR approach that exploits pre-trained language models for encoding text into a learned bag of words.<n>Recently, new efficient -- inverted index-based -- retrieval engines have been proposed, leading to a natural question: has the role of regularization changed in training LSR models?<n>We show that regularization can be relaxed to produce more effective LSR encoders.
arXiv Detail & Related papers (2025-04-30T09:10:46Z) - Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers.<n>We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers.<n>Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z) - Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval [49.669503570350166]
Generative information retrieval (GenIR) is a promising neural retrieval paradigm that formulates document retrieval as a document identifier (docid) generation task.<n>Existing GenIR models suffer from token-level misalignment, where models trained to predict the next token often fail to capture document-level relevance effectively.<n>We propose direct document relevance optimization (DDRO), which aligns token-level docid generation with document-level relevance estimation through direct optimization via pairwise ranking.
arXiv Detail & Related papers (2025-04-07T15:27:37Z) - Learning Task Representations from In-Context Learning [67.66042137487287]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning (ICL)<n>We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads.<n>The proposed method successfully extracts task-specific information from in-context demonstrations and excels in both text and regression tasks.
arXiv Detail & Related papers (2025-02-08T00:16:44Z) - Pointwise Mutual Information as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that the pointwise mutual information between a context and a question is an effective gauge for language model performance.<n>We propose two methods that use the pointwise mutual information between a document and a question as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z) - Beyond Adapter Retrieval: Latent Geometry-Preserving Composition via Sparse Task Projection [22.748835458594744]
We propose a new framework for adapter reuse that moves beyond retrieval.<n>We represent each task by a latent prototype vector and aim to approximate the target task prototype as a sparse linear combination of retrieved reference prototypes.<n>The resulting combination weights are then used to blend the corresponding LoRA adapters, yielding a composite adapter tailored to the target task.
arXiv Detail & Related papers (2024-10-13T16:28:38Z) - Retrieval-Oriented Knowledge for Click-Through Rate Prediction [29.55757862617378]
Click-through rate (CTR) prediction is crucial for personalized online services.
underlineretrieval-underlineoriented underlineknowledge (bfname) framework bypasses the real retrieval process.
name features a knowledge base that preserves and imitates the retrieved & aggregated representations.
arXiv Detail & Related papers (2024-04-28T20:21:03Z) - Repoformer: Selective Retrieval for Repository-Level Code Completion [30.706277772743615]
Recent advances in retrieval-augmented generation (RAG) have initiated a new era in repository-level code completion.
In this paper, we propose a selective RAG framework to avoid retrieval when unnecessary.
We show that our framework is able to accommodate different generation models, retrievers, and programming languages.
arXiv Detail & Related papers (2024-03-15T06:59:43Z) - Integrating Semantics and Neighborhood Information with Graph-Driven
Generative Models for Document Retrieval [51.823187647843945]
In this paper, we encode the neighborhood information with a graph-induced Gaussian distribution, and propose to integrate the two types of information with a graph-driven generative model.
Under the approximation, we prove that the training objective can be decomposed into terms involving only singleton or pairwise documents, enabling the model to be trained as efficiently as uncorrelated ones.
arXiv Detail & Related papers (2021-05-27T11:29:03Z) - IOHanalyzer: Detailed Performance Analyses for Iterative Optimization
Heuristics [3.967483941966979]
IOHanalyzer is a new user-friendly tool for the analysis, comparison, and visualization of performance data of IOHs.
IOHanalyzer provides detailed statistics about fixed-target running times and about fixed-budget performance of the benchmarked algorithms.
IOHanalyzer can directly process performance data from the main benchmarking platforms.
arXiv Detail & Related papers (2020-07-08T08:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.