Related papers: Chunk Twice, Embed Once: A Systematic Study of Segmentation and Representation Trade-offs in Chemistry-Aware Retrieval-Augmented Generation

Chunk Twice, Embed Once: A Systematic Study of Segmentation and Representation Trade-offs in Chemistry-Aware Retrieval-Augmented Generation

URL: http://arxiv.org/abs/2506.17277v1
Date: Fri, 13 Jun 2025 07:44:53 GMT
Title: Chunk Twice, Embed Once: A Systematic Study of Segmentation and Representation Trade-offs in Chemistry-Aware Retrieval-Augmented Generation
Authors: Mahmoud Amiri, Thomas Bocklitz,
Abstract summary: Retrieval-Augmented Generation systems are increasingly vital for navigating the ever-expanding body of scientific literature.<n>This study presents the first large-scale, systematic evaluation of chunking strategies and embedding models tailored to chemistry-focused RAG systems.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval-Augmented Generation (RAG) systems are increasingly vital for navigating the ever-expanding body of scientific literature, particularly in high-stakes domains such as chemistry. Despite the promise of RAG, foundational design choices -- such as how documents are segmented and represented -- remain underexplored in domain-specific contexts. This study presents the first large-scale, systematic evaluation of chunking strategies and embedding models tailored to chemistry-focused RAG systems. We investigate 25 chunking configurations across five method families and evaluate 48 embedding models on three chemistry-specific benchmarks, including the newly introduced QuestChemRetrieval dataset. Our results reveal that recursive token-based chunking (specifically R100-0) consistently outperforms other approaches, offering strong performance with minimal resource overhead. We also find that retrieval-optimized embeddings -- such as Nomic and Intfloat E5 variants -- substantially outperform domain-specialized models like SciBERT. By releasing our datasets, evaluation framework, and empirical benchmarks, we provide actionable guidelines for building effective and efficient chemistry-aware RAG systems.

Related papers

Benchmarking Retrieval-Augmented Generation for Chemistry [28.592844362931853]
Retrieval-augmented generation is a framework for enhancing large language models with external knowledge.<n>ChemRAG-Bench is a benchmark designed to assess the effectiveness of RAG across a diverse set of chemistry-related tasks.<n>ChemRAG-Toolkit is a modular toolkit that supports five retrieval algorithms and eight LLMs.
arXiv Detail & Related papers (2025-05-12T15:34:45Z)
Replication and Exploration of Generative Retrieval over Dynamic Corpora [87.09185685594105]
Generative retrieval (GR) has emerged as a promising paradigm in information retrieval (IR)<n>We show that existing GR models with numericittext-based docids show superior generalization to unseen documents.<n>We propose a novel multi-docid design that leverages both the efficiency of numeric-based docids and the effectiveness of text-based docids.
arXiv Detail & Related papers (2025-04-24T13:01:23Z)
XRAG: eXamining the Core -- Benchmarking Foundational Components in Advanced Retrieval-Augmented Generation [36.84847781022757]
Retrieval-augmented generation (RAG) synergizes the retrieval of pertinent data with the generative capabilities of Large Language Models (LLMs)<n>We introduce XRAG, an open-source, modular that facilitates exhaustive evaluation of the performance of foundational components of advanced RAG modules.
arXiv Detail & Related papers (2024-12-20T03:37:07Z)
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain.<n>Our benchmark is characterized by its multi-dimensional evaluation framework.<n>Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z)
ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain [0.8974531206817746]
This paper introduces a novel benchmark, the Chemical Text Embedding Benchmark (ChemTEB)<n>ChemTEB addresses the unique linguistic and semantic complexities of chemical literature and data.<n>We illuminate the strengths and weaknesses of current methodologies in processing and understanding chemical information.
arXiv Detail & Related papers (2024-11-30T16:45:31Z)
WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking [13.880278087741482]
deep learning has revolutionized computer-aided drug discovery. While deep learning has revolutionized computer-aided drug discovery, the AI community has predominantly focused on model innovation. We seek to establish a new gold standard for small molecule drug discovery benchmarking, WelQrate.
arXiv Detail & Related papers (2024-11-14T21:49:41Z)
Aggregated Knowledge Model: Enhancing Domain-Specific QA with Fine-Tuned and Retrieval-Augmented Generation Models [0.0]
This paper introduces a novel approach to enhancing closed-domain Question Answering (QA) systems. It focuses on the specific needs of the Lawrence Berkeley National Laboratory (LBL) Science Information Technology (ScienceIT) domain.
arXiv Detail & Related papers (2024-10-24T00:49:46Z)
Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored. We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z)
Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings. We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data. We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.