Related papers: Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models

Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models

URL: http://arxiv.org/abs/2410.13192v2
Date: Sat, 14 Dec 2024 12:12:01 GMT
Title: Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models
Authors: Jiatao Li, Xinyu Hu, Xunjian Yin, Xiaojun Wan,
Abstract summary: We investigate the overall effectiveness of Self-Docs, identifying key factors that shape their contribution to RAG performance.<n>Building on these insights, we develop a taxonomy grounded in Systemic Functional Linguistics to compare the influence of various Self-Docs categories.<n>Our findings reveal which types of Self-Docs are most beneficial and offer practical guidelines for leveraging them.
Score: 39.243030042003646
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The integration of documents generated by LLMs themselves (Self-Docs) alongside retrieved documents has emerged as a promising strategy for retrieval-augmented generation systems. However, previous research primarily focuses on optimizing the use of Self-Docs, with their inherent properties remaining underexplored. To bridge this gap, we first investigate the overall effectiveness of Self-Docs, identifying key factors that shape their contribution to RAG performance (RQ1). Building on these insights, we develop a taxonomy grounded in Systemic Functional Linguistics to compare the influence of various Self-Docs categories (RQ2) and explore strategies for combining them with external sources (RQ3). Our findings reveal which types of Self-Docs are most beneficial and offer practical guidelines for leveraging them to achieve significant improvements in knowledge-intensive question answering tasks.

Related papers

Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs [69.10441885629787]
Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge.<n>It falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts.<n>This survey synthesizes both strands under a unified reasoning-retrieval perspective.
arXiv Detail & Related papers (2025-07-13T03:29:41Z)
A Survey on Knowledge-Oriented Retrieval-Augmented Generation [45.65542434522205]
Retrieval-Augmented Generation (RAG) has gained significant attention in recent years. RAG combines large-scale retrieval systems with generative models. We discuss the key characteristics of RAG, such as its ability to augment generative models with dynamic external knowledge.
arXiv Detail & Related papers (2025-03-11T01:59:35Z)
Is Relevance Propagated from Retriever to Generator in RAG? [21.82171240511567]
RAG is a framework for incorporating external knowledge, usually in the form of a set of documents retrieved from a collection. We empirically investigate whether a RAG context comprised of topically relevant documents leads to improved downstream performance.
arXiv Detail & Related papers (2025-02-20T20:21:46Z)
Optimizing Knowledge Integration in Retrieval-Augmented Generation with Self-Selection [72.92366526004464]
Retrieval-Augmented Generation (RAG) has proven effective in enabling Large Language Models (LLMs) to produce more accurate and reliable responses. We propose a novel Self-Selection RAG framework, where the LLM is made to select from pairwise responses generated with internal parametric knowledge solely.
arXiv Detail & Related papers (2025-02-10T04:29:36Z)
GenTREC: The First Test Collection Generated by Large Language Models for Evaluating Information Retrieval Systems [0.33748750222488655]
GenTREC is the first test collection constructed entirely from documents generated by a Large Language Model (LLM) We consider a document relevant only to the prompt that generated it, while other document-topic pairs are treated as non-relevant. The resulting GenTREC collection comprises 96,196 documents, 300 topics, and 18,964 relevance "judgments"
arXiv Detail & Related papers (2025-01-05T00:27:36Z)
Pointwise Mutual Information as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that the pointwise mutual information between a context and a question is an effective gauge for language model performance. We propose two methods that use the pointwise mutual information between a document and a question as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z)
Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs [64.9693406713216]
Internal mechanisms that contribute to the effectiveness of RAG systems remain underexplored. Our experiments reveal that several core groups of experts are primarily responsible for RAG-related behaviors. We propose several strategies to enhance RAG's efficiency and effectiveness through expert activation.
arXiv Detail & Related papers (2024-10-20T16:08:54Z)
Reward-RAG: Enhancing RAG with Reward Driven Supervision [43.66966457772646]
We introduce Reward-RAG, a novel approach designed to enhance the Retrieval-Augmented Generation (RAG) model through Reward-Driven Supervision. Unlike previous RAG methodologies, our method adapts retrieval information to specific domains by employing CriticGPT to train a dedicated reward model. This reward model generates synthesized datasets for fine-tuning the RAG, aligning its outputs more closely with human preferences.
arXiv Detail & Related papers (2024-10-03T15:26:50Z)
Trustworthiness in Retrieval-Augmented Generation Systems: A Survey [59.26328612791924]
Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs) We propose a unified framework that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy.
arXiv Detail & Related papers (2024-09-16T09:06:44Z)
SeRTS: Self-Rewarding Tree Search for Biomedical Retrieval-Augmented Generation [50.26966969163348]
Large Language Models (LLMs) have shown great potential in the biomedical domain with the advancement of retrieval-augmented generation (RAG) Existing retrieval-augmented approaches face challenges in addressing diverse queries and documents, particularly for medical knowledge queries. We propose Self-Rewarding Tree Search (SeRTS) based on Monte Carlo Tree Search (MCTS) and a self-rewarding paradigm.
arXiv Detail & Related papers (2024-06-17T06:48:31Z)
ARAGOG: Advanced RAG Output Grading [44.99833362998488]
Retrieval-Augmented Generation (RAG) is essential for integrating external knowledge into Large Language Model (LLM) outputs. This study assesses various RAG methods' impacts on retrieval precision and answer similarity.
arXiv Detail & Related papers (2024-04-01T10:43:52Z)
REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering [115.72130322143275]
REAR is a RElevance-Aware Retrieval-augmented approach for open-domain question answering (QA) We develop a novel architecture for LLM-based RAG systems, by incorporating a specially designed assessment module. Experiments on four open-domain QA tasks show that REAR significantly outperforms previous a number of competitive RAG approaches.
arXiv Detail & Related papers (2024-02-27T13:22:51Z)
CorpusLM: Towards a Unified Language Model on Corpus for Knowledge-Intensive Tasks [20.390672895839757]
Retrieval-augmented generation (RAG) has emerged as a popular solution to enhance factual accuracy. Traditional retrieval modules often rely on large document index and disconnect with generative tasks. We propose textbfCorpusLM, a unified language model that integrates generative retrieval, closed-book generation, and RAG.
arXiv Detail & Related papers (2024-02-02T06:44:22Z)
Leveraging BERT Language Models for Multi-Lingual ESG Issue Identification [0.30254881201174333]
Investors have increasingly recognized the significance of ESG criteria in their investment choices. The Multi-Lingual ESG Issue Identification (ML-ESG) task encompasses the classification of news documents into 35 distinct ESG issue labels. In this study, we explored multiple strategies harnessing BERT language models to achieve accurate classification of news documents across these labels.
arXiv Detail & Related papers (2023-09-05T12:48:21Z)
A Survey on Large Language Models for Recommendation [77.91673633328148]
Large Language Models (LLMs) have emerged as powerful tools in the field of Natural Language Processing (NLP) This survey presents a taxonomy that categorizes these models into two major paradigms, respectively Discriminative LLM for Recommendation (DLLM4Rec) and Generative LLM for Recommendation (GLLM4Rec)
arXiv Detail & Related papers (2023-05-31T13:51:26Z)
Guiding Generative Language Models for Data Augmentation in Few-Shot Text Classification [59.698811329287174]
We leverage GPT-2 for generating artificial training instances in order to improve classification performance. Our results show that fine-tuning GPT-2 in a handful of label instances leads to consistent classification improvements.
arXiv Detail & Related papers (2021-11-17T12:10:03Z)
AdaSGD: Bridging the gap between SGD and Adam [14.886598905466604]
We identify potential differences in performance between SGD and Adam. We demonstrate how AdaSGD combines the benefits both SGD Adam and SGD non- descent.
arXiv Detail & Related papers (2020-06-30T05:44:19Z)
Generative Data Augmentation for Commonsense Reasoning [75.26876609249197]
G-DAUGC is a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. G-DAUGC consistently outperforms existing data augmentation methods based on back-translation. Our analysis demonstrates that G-DAUGC produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.
arXiv Detail & Related papers (2020-04-24T06:12:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.