Related papers: Enhancing Domain-Specific Retrieval-Augmented Generation: Synthetic Data Generation and Evaluation using Reasoning Models

Enhancing Domain-Specific Retrieval-Augmented Generation: Synthetic Data Generation and Evaluation using Reasoning Models

URL: http://arxiv.org/abs/2502.15854v1
Date: Fri, 21 Feb 2025 06:38:57 GMT
Title: Enhancing Domain-Specific Retrieval-Augmented Generation: Synthetic Data Generation and Evaluation using Reasoning Models
Authors: Aryan Jadon, Avinash Patil, Shashank Kumar,
Abstract summary: Retrieval-Augmented Generation (RAG) systems face significant performance gaps when applied to technical domains.<n>We propose a framework combining granular evaluation metrics with synthetic data generation to optimize domain-specific RAG performance.<n>Our empirical analysis reveals critical insights: smaller chunks (less than 10 tokens) improve precision by 31-42%.
Score: 0.6827423171182154
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieval-Augmented Generation (RAG) systems face significant performance gaps when applied to technical domains requiring precise information extraction from complex documents. Current evaluation methodologies relying on document-level metrics inadequately capture token-resolution retrieval accuracy that is critical for domain-related documents. We propose a framework combining granular evaluation metrics with synthetic data generation to optimize domain-specific RAG performance. First, we introduce token-aware metrics Precision $\Omega$ and Intersection-over-Union (IoU) that quantify context preservation versus information density trade-offs inherent in technical texts. Second, we develop a reasoning model-driven pipeline using instruction-tuned LLMs (DeepSeek-R1, DeepSeek-R1 distilled variants, and Phi-4) to generate context-anchored QA pairs with discontinuous reference spans across three specialized corpora: SEC 10-K filings (finance), biomedical abstracts (PubMed), and APT threat reports (cybersecurity). Our empirical analysis reveals critical insights: smaller chunks (less than 10 tokens) improve precision by 31-42% (IoU = 0.071 vs. baseline 0.053) at recall costs (-18%), while domain-specific embedding strategies yield 22% variance in optimal chunk sizing (5-20 tokens). The DeepSeek-R1-Distill-Qwen-32B model demonstrates superior concept alignment (+14% mean IoU over alternatives), though no configuration universally dominates. Financial texts favor larger chunks for risk factor coverage (Recall = 0.81 at size = 20), whereas cybersecurity content benefits from atomic segmentation, Precision $\Omega = 0.28$ at size = 5. Our code is available on https://github.com/aryan-jadon/Synthetic-Data-Generation-and-Evaluation-using-Reasoning-Model

Related papers

Advancing Retrieval-Augmented Generation for Structured Enterprise and Internal Data [0.0]
Large Language Models (LLMs) have strong generative capabilities.<n>They are limited by static pretraining, short context windows, and challenges in processing heterogeneous data formats.<n>Conventional Retrieval-Augmented Generation (RAG) frameworks address some of these gaps but often struggle with structured and semi-structured data.<n>This work proposes an advanced RAG framework that combines hybrid retrieval strategies using dense embeddings (all-mpnet-base-v2) and BM25, enhanced by metadata-aware filtering with SpaCy NER and cross-encoder reranking.
arXiv Detail & Related papers (2025-07-16T17:13:06Z)
Multi-Granular Discretization for Interpretable Generalization in Precise Cyberattack Identification [0.0]
Interpretable Generalization (IG) mechanism is used to learn coherent patterns.<n>IG-MD represents every continuous feature at several Gaussian-based resolutions.<n>On UKM-IDS20, IG-MD lifts precision by greater than or equal to 4 percentage points across all nine train-test splits.
arXiv Detail & Related papers (2025-07-16T12:57:38Z)
Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains [13.58151841630302]
We propose a novel method METEORA that replaces re-ranking in RAG with a rationale-driven selection approach.<n>We show METEORA improves generation accuracy by 33.34% while using approximately 50% fewer chunks than state-of-the-art re-ranking methods.<n>In adversarial settings, METEORA significantly improves the F1 score from 0.10 to 0.44.
arXiv Detail & Related papers (2025-05-21T20:57:16Z)
PCA-RAG: Principal Component Analysis for Efficient Retrieval-Augmented Generation [0.0]
High-dimensional language model embeddings can present scalability challenges in terms of storage and latency. This paper investigates the use of Principal Component Analysis (PCA) to reduce embedding dimensionality. We show that PCA-based compression offers a viable balance between retrieval fidelity and resource efficiency.
arXiv Detail & Related papers (2025-04-11T09:38:12Z)
START: Self-taught Reasoner with Tools [51.38785489790888]
We introduce START (Self-Taught Reasoner with Tools), a tool-integrated long Chain-of-thought (CoT) reasoning LLM. START is capable of performing complex computations, self-checking, exploring diverse methods, and self-ging. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B.
arXiv Detail & Related papers (2025-03-06T17:11:51Z)
Claim Extraction for Fact-Checking: Data, Models, and Automated Metrics [0.0]
We release the FEVERFact dataset, with 17K atomic factual claims extracted from 4K contextualised Wikipedia sentences.<n>For each metric, we implement a scale using a reduction to an already-explored NLP task.<n>We validate our metrics against human grading of generic claims, to see that the model ranking on $F_fact$, our hardest metric, did not change.
arXiv Detail & Related papers (2025-02-07T14:20:45Z)
The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility. This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance. We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z)
Certifiably Robust Model Evaluation in Federated Learning under Meta-Distributional Shifts [8.700087812420687]
We provide guarantees for the model's performance on a different, unseen network "B"<n>We show how the principled vanilla DKW bound enables certification of the model's true performance on unseen clients within the same (source) network.
arXiv Detail & Related papers (2024-10-26T18:45:15Z)
Improved Out-of-Scope Intent Classification with Dual Encoding and Threshold-based Re-Classification [6.975902383951604]
Current methodologies face difficulties with the unpredictable distribution of outliers. We present the Dual for Threshold-Based Re-Classification (DETER) to address these challenges. Our model outperforms previous benchmarks, increasing up to 13% and 5% in F1 score for known and unknown intents.
arXiv Detail & Related papers (2024-05-30T11:46:42Z)
Linear-time Minimum Bayes Risk Decoding with Reference Aggregation [52.1701152610258]
Minimum Bayes Risk (MBR) decoding is a text generation technique that has been shown to improve the quality of machine translations. It requires the pairwise calculation of a utility metric, which has quadratic complexity. We propose to approximate pairwise metric scores with scores calculated against aggregated reference representations.
arXiv Detail & Related papers (2024-02-06T18:59:30Z)
Beyond Accuracy: Automated De-Identification of Large Real-World Clinical Text Datasets [7.6631083158336715]
This paper summarizes lessons learned from building a system used to de-identify over one billion real clinical notes. A fully automated solution requires a very high level of accuracy that does not require manual review.
arXiv Detail & Related papers (2023-12-13T20:15:29Z)
Revisiting Evaluation Metrics for Semantic Segmentation: Optimization and Evaluation of Fine-grained Intersection over Union [113.20223082664681]
We propose the use of fine-grained mIoUs along with corresponding worst-case metrics. These fine-grained metrics offer less bias towards large objects, richer statistical information, and valuable insights into model and dataset auditing. Our benchmark study highlights the necessity of not basing evaluations on a single metric and confirms that fine-grained mIoUs reduce the bias towards large objects.
arXiv Detail & Related papers (2023-10-30T03:45:15Z)
Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations. We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z)
Bridging the Domain Gaps in Context Representations for k-Nearest Neighbor Neural Machine Translation [57.49095610777317]
$k$-Nearest neighbor machine translation ($k$NN-MT) has attracted increasing attention due to its ability to non-parametrically adapt to new translation domains. We propose a novel approach to boost the datastore retrieval of $k$NN-MT by reconstructing the original datastore. Our method can effectively boost the datastore retrieval and translation quality of $k$NN-MT.
arXiv Detail & Related papers (2023-05-26T03:04:42Z)
Out-of-Vocabulary Entities in Link Prediction [1.9036571490366496]
Link prediction is often used as a proxy to evaluate the quality of embeddings. As benchmarks are crucial for the fair comparison of algorithms, ensuring their quality is tantamount to providing a solid ground for developing better solutions. We provide an implementation of an approach for spotting and removing such entities and provide corrected versions of the datasets WN18RR, FB15K-237, and YAGO3-10.
arXiv Detail & Related papers (2021-05-26T12:58:18Z)
Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles. Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center. We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes. A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.