Enhancing Retrieval-Augmented Generation with Topic-Enriched Embeddings: A Hybrid Approach Integrating Traditional NLP Techniques
- URL: http://arxiv.org/abs/2601.00891v1
- Date: Wed, 31 Dec 2025 13:43:57 GMT
- Title: Enhancing Retrieval-Augmented Generation with Topic-Enriched Embeddings: A Hybrid Approach Integrating Traditional NLP Techniques
- Authors: Rodrigo Kataishi,
- Abstract summary: This work proposes topic-enriched embeddings that integrate term-based signals and topic structure with contextual sentence embeddings.<n>By jointly capturing term-level and topic-level semantics, topic-enriched embeddings improve semantic clustering, increase retrieval precision, and reduce computational burden.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval-augmented generation (RAG) systems rely on accurate document retrieval to ground large language models (LLMs) in external knowledge, yet retrieval quality often degrades in corpora where topics overlap and thematic variation is high. This work proposes topic-enriched embeddings that integrate term-based signals and topic structure with contextual sentence embeddings. The approach combines TF-IDF with topic modeling and dimensionality reduction, using Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) to encode latent topical organization, and fuses these representations with a compact contextual encoder (all-MiniLM). By jointly capturing term-level and topic-level semantics, topic-enriched embeddings improve semantic clustering, increase retrieval precision, and reduce computational burden relative to purely contextual baselines. Experiments on a legal-text corpus show consistent gains in clustering coherence and retrieval metrics, suggesting that topic-enriched embeddings can serve as a practical component for more reliable knowledge-intensive RAG pipelines.
Related papers
- Grounding Long-Context Reasoning with Contextual Normalization for Retrieval-Augmented Generation [57.97548022208733]
We show that seemingly superficial choices in key-value extraction can induce shifts in accuracy and stability.<n>We introduce Contextual Normalization, a strategy that adaptively standardizes context representations before generation.
arXiv Detail & Related papers (2025-10-15T06:28:25Z) - Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM-Guided Multi-Aspect Clustering [59.54662810933882]
Existing taxonomy construction methods, leveraging unsupervised clustering or direct prompting of large language models, often lack coherence and granularity.<n>We propose a novel context-aware hierarchical taxonomy generation framework that integrates LLM-guided multi-aspect encoding with dynamic clustering.
arXiv Detail & Related papers (2025-09-23T15:12:58Z) - Topic Identification in LLM Input-Output Pairs through the Lens of Information Bottleneck [0.0]
We develop a principled topic identification method grounded in the Deterministic Information Bottleneck (DIB) for geometric clustering.<n>Our key contribution is to transform the DIB method into a practical algorithm for high-dimensional data by substituting its intractable KL divergence term with a computationally efficient upper bound.
arXiv Detail & Related papers (2025-08-26T20:00:51Z) - Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question Answering [49.43814054718318]
Multi-hop question answering (MHQA) requires integrating knowledge scattered across multiple passages to derive the correct answer.<n>Traditional retrieval-augmented generation (RAG) methods primarily focus on coarse-grained textual semantic similarity.<n>We propose a novel RAG approach called HGRAG for MHQA that achieves cross-granularity integration of structural and semantic information via hypergraphs.
arXiv Detail & Related papers (2025-08-15T06:36:13Z) - Iterative Augmentation with Summarization Refinement (IASR) Evaluation for Unstructured Survey data Modeling and Analysis [0.43988112145759295]
This work introduces a principled evaluation framework for large language model (LLM) based text augmentation.<n> Empirical evaluations show that GPT-3.5 Turbo achieved the best balance of semantic fidelity, diversity, and generation efficiency.
arXiv Detail & Related papers (2025-07-16T10:49:30Z) - Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking [0.9968037829925942]
This paper proposes a novel framework that enhances RAG by integrating hierarchical text segmentation and clustering.<n>During inference, the framework retrieves information by leveraging both segment-level and cluster-level vector representations.<n> Evaluations on the NarrativeQA, QuALITY, and QASPER datasets indicate that the proposed method achieved improved results compared to traditional chunking techniques.
arXiv Detail & Related papers (2025-07-14T05:21:58Z) - Learning Interpretable Representations Leads to Semantically Faithful EEG-to-Text Generation [52.51005875755718]
We focus on EEG-to-text decoding and address its hallucination issue through the lens of posterior collapse.<n>Acknowledging the underlying mismatch in information capacity between EEG and text, we reframe the decoding task as semantic summarization of core meanings.<n>Experiments on the public ZuCo dataset demonstrate that GLIM consistently generates fluent, EEG-grounded sentences.
arXiv Detail & Related papers (2025-05-21T05:29:55Z) - HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning [6.2751089721877955]
RAG enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge.<n>The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain's specialized content.<n>This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning.
arXiv Detail & Related papers (2024-12-05T23:10:56Z) - Contextual Categorization Enhancement through LLMs Latent-Space [0.31263095816232184]
We propose leveraging transformer models to distill semantic information from texts in the Wikipedia dataset.
We then explore different approaches based on these encodings to assess and enhance the semantic identity of the categories.
arXiv Detail & Related papers (2024-04-25T09:20:51Z) - Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual
Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision.
Existing literature addresses this challenge by employing local-based representation approaches.
This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z) - TopicNet: Semantic Graph-Guided Topic Discovery [51.71374479354178]
Existing deep hierarchical topic models are able to extract semantically meaningful topics from a text corpus in an unsupervised manner.
We introduce TopicNet as a deep hierarchical topic model that can inject prior structural knowledge as an inductive bias to influence learning.
arXiv Detail & Related papers (2021-10-27T09:07:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.