Advancing Retrieval-Augmented Generation for Structured Enterprise and Internal Data
- URL: http://arxiv.org/abs/2507.12425v1
- Date: Wed, 16 Jul 2025 17:13:06 GMT
- Title: Advancing Retrieval-Augmented Generation for Structured Enterprise and Internal Data
- Authors: Chandana Cheerla,
- Abstract summary: Large Language Models (LLMs) have strong generative capabilities.<n>They are limited by static pretraining, short context windows, and challenges in processing heterogeneous data formats.<n>Conventional Retrieval-Augmented Generation (RAG) frameworks address some of these gaps but often struggle with structured and semi-structured data.<n>This work proposes an advanced RAG framework that combines hybrid retrieval strategies using dense embeddings (all-mpnet-base-v2) and BM25, enhanced by metadata-aware filtering with SpaCy NER and cross-encoder reranking.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Organizations increasingly rely on proprietary enterprise data, including HR records, structured reports, and tabular documents, for critical decision-making. While Large Language Models (LLMs) have strong generative capabilities, they are limited by static pretraining, short context windows, and challenges in processing heterogeneous data formats. Conventional Retrieval-Augmented Generation (RAG) frameworks address some of these gaps but often struggle with structured and semi-structured data. This work proposes an advanced RAG framework that combines hybrid retrieval strategies using dense embeddings (all-mpnet-base-v2) and BM25, enhanced by metadata-aware filtering with SpaCy NER and cross-encoder reranking. The framework applies semantic chunking to maintain textual coherence and retains tabular data structures to preserve row-column integrity. Quantized indexing optimizes retrieval efficiency, while human-in-the-loop feedback and conversation memory improve adaptability. Experiments on enterprise datasets show notable improvements: Precision@5 increased by 15 percent (90 versus 75), Recall@5 by 13 percent (87 versus 74), and Mean Reciprocal Rank by 16 percent (0.85 versus 0.69). Qualitative evaluations show higher scores in Faithfulness (4.6 versus 3.0), Completeness (4.2 versus 2.5), and Relevance (4.5 versus 3.2) on a 5-point Likert scale. These results demonstrate the framework's effectiveness in delivering accurate, comprehensive, and contextually relevant responses for enterprise tasks. Future work includes extending to multimodal data and integrating agent-based retrieval. The source code will be released at https://github.com/CheerlaChandana/Enterprise-Chatbot
Related papers
- RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation [51.86515213749527]
We present RoboTwin 2.0, a scalable simulation framework that enables automated, large-scale generation of diverse and realistic data.<n>To improve sim-to-real transfer, RoboTwin 2.0 incorporates structured domain randomization along five axes: clutter, lighting, background, tabletop height and language instructions.<n>We instantiate this framework across 50 dual-arm tasks spanning five robot embodiments, and pre-collect over 100,000 domain-randomized expert trajectories.
arXiv Detail & Related papers (2025-06-22T16:26:53Z) - Combatting Dimensional Collapse in LLM Pre-Training Data via Diversified File Selection [65.96556073745197]
DiverSified File selection algorithm (DiSF) is proposed to select the most decorrelated text files in the feature space.<n>DiSF saves 98.5% of 590M training files in SlimPajama, outperforming the full-data pre-training within a 50B training budget.
arXiv Detail & Related papers (2025-04-29T11:13:18Z) - MMKB-RAG: A Multi-Modal Knowledge-Based Retrieval-Augmented Generation Framework [15.410873298893817]
We propose Multi-Modal Knowledge-Based Retrieval-Augmented Generation (MMKB-RAG)<n>This framework leverages the inherent knowledge boundaries of models to dynamically generate semantic tags for the retrieval process.<n>Extensive experiments on knowledge-based visual question-answering tasks demonstrate the efficacy of our approach.
arXiv Detail & Related papers (2025-04-14T10:19:47Z) - Geometric Median Matching for Robust k-Subset Selection from Noisy Data [75.86423267723728]
We propose a novel k-subset selection strategy that leverages Geometric Median -- a robust estimator with an optimal breakdown point of 1/2.<n>Our method iteratively selects a k-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption.
arXiv Detail & Related papers (2025-04-01T09:22:05Z) - Enhancing Domain-Specific Retrieval-Augmented Generation: Synthetic Data Generation and Evaluation using Reasoning Models [0.6827423171182154]
Retrieval-Augmented Generation (RAG) systems face significant performance gaps when applied to technical domains.<n>We propose a framework combining granular evaluation metrics with synthetic data generation to optimize domain-specific RAG performance.<n>Our empirical analysis reveals critical insights: smaller chunks (less than 10 tokens) improve precision by 31-42%.
arXiv Detail & Related papers (2025-02-21T06:38:57Z) - SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval [0.7421845364041001]
This paper introduces SKETCH, a novel methodology that enhances the RAG retrieval process by integrating semantic text retrieval with knowledge graphs.<n>SKETCH consistently outperforms baseline approaches on key RAGAS metrics such as answer_relevancy, faithfulness, context_precision and context_recall.<n>Results highlight SKETCH's capability in delivering more accurate and contextually relevant responses, setting new benchmarks for future retrieval systems.
arXiv Detail & Related papers (2024-12-19T22:51:56Z) - Enhancing Cross-Language Code Translation via Task-Specific Embedding Alignment in Retrieval-Augmented Generation [1.64043572114825]
We introduce a novel method to enhance cross-language code translation from Fortran to C++ by integrating task-specific embedding alignment.<n>Our strategy aligns the retrieval model directly with the objective of maximizing translation quality, as quantified by the CodeBLEU metric.<n>By integrating these CodeBLEU-optimized embeddings into the RAG framework, our approach significantly enhances both retrieval accuracy and code generation quality.
arXiv Detail & Related papers (2024-12-06T16:22:32Z) - Reducing and Exploiting Data Augmentation Noise through Meta Reweighting
Contrastive Learning for Text Classification [3.9889306957591755]
We propose a novel framework to boost deep learning models' performance given augmented data/samples in text classification tasks.
We propose novel weight-dependent enqueue and dequeue algorithms to utilize augmented samples' weight/quality information effectively.
Our framework achieves an average of 1.6%, up to 4.3% absolute improvement on Text-CNN encoders and an average of 1.4%, up to 4.4% absolute improvement on RoBERTa-base encoders.
arXiv Detail & Related papers (2024-09-26T02:19:13Z) - Synchronous Faithfulness Monitoring for Trustworthy Retrieval-Augmented Generation [96.78845113346809]
Retrieval-augmented language models (RALMs) have shown strong performance and wide applicability in knowledge-intensive tasks.
This paper proposes SynCheck, a lightweight monitor that leverages fine-grained decoding dynamics to detect unfaithful sentences.
We also introduce FOD, a faithfulness-oriented decoding algorithm guided by beam search for long-form retrieval-augmented generation.
arXiv Detail & Related papers (2024-06-19T16:42:57Z) - Retrieval-based Disentangled Representation Learning with Natural
Language Supervision [61.75109410513864]
We present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning.
Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish intrinsic dimensions that capture characteristics within data through its natural language counterpart, thus disentanglement.
arXiv Detail & Related papers (2022-12-15T10:20:42Z) - Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent
Semantic Parsing [52.24507547010127]
Cross-domain context-dependent semantic parsing is a new focus of research.
We present a dynamic graph framework that effectively modelling contextual utterances, tokens, database schemas, and their complicated interaction as the conversation proceeds.
The proposed framework outperforms all existing models by large margins, achieving new state-of-the-art performance on two large-scale benchmarks.
arXiv Detail & Related papers (2021-01-05T18:11:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.