Breaking It Down: Domain-Aware Semantic Segmentation for Retrieval Augmented Generation
- URL: http://arxiv.org/abs/2512.00367v1
- Date: Sat, 29 Nov 2025 07:30:37 GMT
- Title: Breaking It Down: Domain-Aware Semantic Segmentation for Retrieval Augmented Generation
- Authors: Aparajitha Allamraju, Maitreya Prafulla Chitale, Hiranmai Sri Adibhatla, Rahul Mishra, Manish Shrivastava,
- Abstract summary: This paper introduces two efficient semantic chunking methods, Projected Similarity Chunking (PSC) and Metric Fusion Chunking (MFC)<n>Our results show substantial retrieval improvements (24x with PSC) in MRR and higher Hits@k on PubMedQA.<n>Despite being trained on a single domain, PSC and MFC also generalize well, achieving strong out-of-domain generation performance across multiple datasets.
- Score: 5.0491491564528515
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document chunking is a crucial component of Retrieval-Augmented Generation (RAG), as it directly affects the retrieval of relevant and precise context. Conventional fixed-length and recursive splitters often produce arbitrary, incoherent segments that fail to preserve semantic structure. Although semantic chunking has gained traction, its influence on generation quality remains underexplored. This paper introduces two efficient semantic chunking methods, Projected Similarity Chunking (PSC) and Metric Fusion Chunking (MFC), trained on PubMed data using three different embedding models. We further present an evaluation framework that measures the effect of chunking on both retrieval and generation by augmenting PubMedQA with full-text PubMed Central articles. Our results show substantial retrieval improvements (24x with PSC) in MRR and higher Hits@k on PubMedQA. We provide a comprehensive analysis, including statistical significance and response-time comparisons with common chunking libraries. Despite being trained on a single domain, PSC and MFC also generalize well, achieving strong out-of-domain generation performance across multiple datasets. Overall, our findings confirm that our semantic chunkers, especially PSC, consistently deliver superior performance.
Related papers
- Multi-Vector Index Compression in Any Modality [73.7330345057813]
Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos.<n>We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC)<n>AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation.
arXiv Detail & Related papers (2026-02-24T18:57:33Z) - Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models [64.58262227709842]
ARISE (Attention-weighted Representation with Integrated Semantic Embeddings) is presented.<n>It builds semantic-aware representations that complement the metric space of categorical data for accurate clustering.<n>Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts.
arXiv Detail & Related papers (2026-01-03T11:37:46Z) - A New HOPE: Domain-agnostic Automatic Evaluation of Text Chunking [44.47350338664039]
Document chunking fundamentally impacts Retrieval-Augmented Generation (RAG)<n>There is currently no framework to analyze the impact of different chunking methods.<n>We introduce a novel methodology that defines essential characteristics of the chunking process at three levels.
arXiv Detail & Related papers (2025-05-04T16:22:27Z) - Passage Segmentation of Documents for Extractive Question Answering [0.0]
This study emphasizes the critical role of chunking in improving the performance of both dense passage retrieval and the end-to-end RAG pipeline.<n>We introduce the Logits-Guided Multi-Granular Chunker (LGMGC), a novel framework that splits long documents into contextualized, self-contained chunks of varied granularity.
arXiv Detail & Related papers (2025-01-17T03:42:18Z) - Attention with Dependency Parsing Augmentation for Fine-Grained Attribution [26.603281615221505]
We develop a fine-grained attribution mechanism that provides supporting evidence from retrieved documents for every answer span.<n>Existing attribution methods rely on model-internal similarity metrics between responses and documents, such as saliency scores and hidden state similarity.<n>We propose two techniques applicable to all model-internals-based methods. First, we aggregate token-wise evidence through set union operations, preserving the granularity of representations.<n>Second, we enhance the attributor by integrating dependency parsing to enrich the semantic completeness of target spans.
arXiv Detail & Related papers (2024-12-16T03:12:13Z) - SiReRAG: Indexing Similar and Related Information for Multihop Reasoning [96.60045548116584]
SiReRAG is a novel RAG indexing approach that explicitly considers both similar and related information.<n>SiReRAG consistently outperforms state-of-the-art indexing methods on three multihop datasets.
arXiv Detail & Related papers (2024-12-09T04:56:43Z) - Every Component Counts: Rethinking the Measure of Success for Medical Semantic Segmentation in Multi-Instance Segmentation Tasks [60.80828925396154]
We present Connected-Component(CC)-Metrics, a novel semantic segmentation evaluation protocol.
We motivate this setup in the common medical scenario of semantic segmentation in a full-body PET/CT.
We show how existing semantic segmentation metrics suffer from a bias towards larger connected components.
arXiv Detail & Related papers (2024-10-24T12:26:05Z) - Is Semantic Chunking Worth the Computational Cost? [0.0]
This study systematically evaluates the effectiveness of semantic chunking using three common retrieval-related tasks.
The results show that the computational costs associated with semantic chunking are not justified by consistent performance gains.
arXiv Detail & Related papers (2024-10-16T21:53:48Z) - Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception [10.614437503578856]
This paper proposes the Meta-Chunking framework, which specifically enhances chunking quality.<n>We design two adaptive chunking techniques based on uncertainty, namely Perplexity Chunking and Margin Sampling Chunking.<n>We establish the global information compensation mechanism, encompassing a two-stage hierarchical summary generation process and a three-stage text chunk rewriting procedure.
arXiv Detail & Related papers (2024-10-16T17:59:32Z) - ASPS: Augmented Segment Anything Model for Polyp Segmentation [77.25557224490075]
The Segment Anything Model (SAM) has introduced unprecedented potential for polyp segmentation.
SAM's Transformer-based structure prioritizes global and low-frequency information.
CFA integrates a trainable CNN encoder branch with a frozen ViT encoder, enabling the integration of domain-specific knowledge.
arXiv Detail & Related papers (2024-06-30T14:55:32Z) - CorrMatch: Label Propagation via Correlation Matching for
Semi-Supervised Semantic Segmentation [73.89509052503222]
This paper presents a simple but performant semi-supervised semantic segmentation approach, called CorrMatch.
We observe that the correlation maps not only enable clustering pixels of the same category easily but also contain good shape information.
We propose to conduct pixel propagation by modeling the pairwise similarities of pixels to spread the high-confidence pixels and dig out more.
Then, we perform region propagation to enhance the pseudo labels with accurate class-agnostic masks extracted from the correlation maps.
arXiv Detail & Related papers (2023-06-07T10:02:29Z) - FCN-Transformer Feature Fusion for Polyp Segmentation [12.62213319797323]
Colonoscopy is widely recognised as the gold standard procedure for the early detection of colorectal cancer.
The manual segmentation of polyps in colonoscopy images is time-consuming.
The use of deep learning for automation of polyp segmentation has become important.
arXiv Detail & Related papers (2022-08-17T15:31:06Z) - CoSeg: Cognitively Inspired Unsupervised Generic Event Segmentation [118.18977078626776]
We propose an end-to-end self-supervised learning framework for event segmentation/boundary detection.
Our framework exploits a transformer-based feature reconstruction scheme to detect event boundary by reconstruction errors.
The goal of our work is to segment generic events rather than localize some specific ones.
arXiv Detail & Related papers (2021-09-30T14:40:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.