Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking
- URL: http://arxiv.org/abs/2507.09935v1
- Date: Mon, 14 Jul 2025 05:21:58 GMT
- Title: Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking
- Authors: Hai Toan Nguyen, Tien Dat Nguyen, Viet Ha Nguyen,
- Abstract summary: This paper proposes a novel framework that enhances RAG by integrating hierarchical text segmentation and clustering.<n>During inference, the framework retrieves information by leveraging both segment-level and cluster-level vector representations.<n> Evaluations on the NarrativeQA, QuALITY, and QASPER datasets indicate that the proposed method achieved improved results compared to traditional chunking techniques.
- Score: 0.9968037829925942
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval-Augmented Generation (RAG) systems commonly use chunking strategies for retrieval, which enhance large language models (LLMs) by enabling them to access external knowledge, ensuring that the retrieved information is up-to-date and domain-specific. However, traditional methods often fail to create chunks that capture sufficient semantic meaning, as they do not account for the underlying textual structure. This paper proposes a novel framework that enhances RAG by integrating hierarchical text segmentation and clustering to generate more meaningful and semantically coherent chunks. During inference, the framework retrieves information by leveraging both segment-level and cluster-level vector representations, thereby increasing the likelihood of retrieving more precise and contextually relevant information. Evaluations on the NarrativeQA, QuALITY, and QASPER datasets indicate that the proposed method achieved improved results compared to traditional chunking techniques.
Related papers
- DISRetrieval: Harnessing Discourse Structure for Long Document Retrieval [51.89673002051528]
DISRetrieval is a novel hierarchical retrieval framework that leverages linguistic discourse structure to enhance long document understanding.<n>Our studies confirm that discourse structure significantly enhances retrieval effectiveness across different document lengths and query types.
arXiv Detail & Related papers (2025-05-26T14:45:12Z) - ELITE: Embedding-Less retrieval with Iterative Text Exploration [5.8851517822935335]
Large Language Models (LLMs) have achieved impressive progress in natural language processing.<n>Their limited ability to retain long-term context constrains performance on document-level or multi-turn tasks.
arXiv Detail & Related papers (2025-05-17T08:48:43Z) - SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs [70.79124435220695]
We propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE)<n>We first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation.<n>We then introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination.
arXiv Detail & Related papers (2025-04-17T17:59:27Z) - Bridging Textual-Collaborative Gap through Semantic Codes for Sequential Recommendation [91.13055384151897]
CCFRec is a novel Code-based textual and Collaborative semantic Fusion method for sequential Recommendation.<n>We generate fine-grained semantic codes from multi-view text embeddings through vector quantization techniques.<n>In order to further enhance the fusion of textual and collaborative semantics, we introduce an optimization strategy.
arXiv Detail & Related papers (2025-03-15T15:54:44Z) - TrustRAG: An Information Assistant with Retrieval Augmented Generation [73.84864898280719]
TrustRAG is a novel framework that enhances acRAG from three perspectives: indexing, retrieval, and generation.<n>We open-source the TrustRAG framework and provide a demonstration studio designed for excerpt-based question answering tasks.
arXiv Detail & Related papers (2025-02-19T13:45:27Z) - Clustering Algorithms and RAG Enhancing Semi-Supervised Text Classification with Large LLMs [1.6575279044457722]
This paper proposes a Clustering, Labeling, then Augmenting framework that enhances performance in Semi-Supervised Text Classification tasks.<n>Unlike traditional SSTC approaches, this framework employs clustering to select representative "landmarks" for labeling.<n> Empirical results show that even in complex text document classification scenarios involving over 100 categories, our method achieves state-of-the-art accuracies of 95.41% on the Reuters dataset and 82.43% on the Web of Science dataset.
arXiv Detail & Related papers (2024-11-09T13:17:39Z) - Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval [49.42043077545341]
We propose a knowledge-aware query expansion framework, augmenting LLMs with structured document relations from knowledge graph (KG)<n>We leverage document texts as rich KG node representations and use document-based relation filtering for our Knowledge-Aware Retrieval (KAR)
arXiv Detail & Related papers (2024-10-17T17:03:23Z) - Neural Sequence-to-Sequence Modeling with Attention by Leveraging Deep Learning Architectures for Enhanced Contextual Understanding in Abstractive Text Summarization [0.0]
This paper presents a novel framework for abstractive TS of single documents.
It integrates three dominant aspects: structure, semantic, and neural-based approaches.
Results indicate significant improvements in handling rare and OOV words.
arXiv Detail & Related papers (2024-04-08T18:33:59Z) - Text Clustering with Large Language Model Embeddings [0.0]
The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms.<n>Recent advancements in large language models (LLMs) have the potential to enhance this task.<n>Findings indicate that LLM embeddings are superior at capturing subtleties in structured language.
arXiv Detail & Related papers (2024-03-22T11:08:48Z) - Contextualization Distillation from Large Language Model for Knowledge
Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks.
Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments.
Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z) - CLIP-GCD: Simple Language Guided Generalized Category Discovery [21.778676607030253]
Generalized Category Discovery (GCD) requires a model to both classify known categories and cluster unknown categories in unlabeled data.
Prior methods leveraged self-supervised pre-training combined with supervised fine-tuning on the labeled data, followed by simple clustering methods.
We propose to leverage multi-modal (vision and language) models, in two complementary ways.
arXiv Detail & Related papers (2023-05-17T17:55:33Z) - CEIL: A General Classification-Enhanced Iterative Learning Framework for
Text Clustering [16.08402937918212]
We propose a novel Classification-Enhanced Iterative Learning framework for short text clustering.
In each iteration, we first adopt a language model to retrieve the initial text representations.
After strict data filtering and aggregation processes, samples with clean category labels are retrieved, which serve as supervision information.
Finally, the updated language model with improved representation ability is used to enhance clustering in the next iteration.
arXiv Detail & Related papers (2023-04-20T14:04:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.