Read the Docs Before Rewriting: Equip Rewriter with Domain Knowledge via Continual Pre-training
- URL: http://arxiv.org/abs/2507.00477v1
- Date: Tue, 01 Jul 2025 06:51:00 GMT
- Title: Read the Docs Before Rewriting: Equip Rewriter with Domain Knowledge via Continual Pre-training
- Authors: Qi Wang, Yixuan Cao, Yifan Liu, Jiangtao Zhao, Ping Luo,
- Abstract summary: A RAG-based question-answering system retrieves documents based on user queries.<n>In specialized domains, the rewriter model may struggle due to limited domain-specific knowledge.<n>We propose the R&R (Read the doc before Rewriting) rewriter, which involves continual pre-training on professional documents.
- Score: 35.17495480087131
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A Retrieval-Augmented Generation (RAG)-based question-answering (QA) system enhances a large language model's knowledge by retrieving relevant documents based on user queries. Discrepancies between user queries and document phrasings often necessitate query rewriting. However, in specialized domains, the rewriter model may struggle due to limited domain-specific knowledge. To resolve this, we propose the R\&R (Read the doc before Rewriting) rewriter, which involves continual pre-training on professional documents, akin to how students prepare for open-book exams by reviewing textbooks. Additionally, it can be combined with supervised fine-tuning for improved results. Experiments on multiple datasets demonstrate that R\&R excels in professional QA across multiple domains, effectively bridging the query-document gap, while maintaining good performance in general scenarios, thus advancing the application of RAG-based QA systems in specialized fields.
Related papers
- Improving Scientific Document Retrieval with Concept Coverage-based Query Set Generation [49.29180578078616]
Concept Coverage-based Query set Generation (CCQGen) framework designed to generate a set of queries with comprehensive coverage of the document's concepts.<n>We identify concepts not sufficiently covered by previous queries, and leverage them as conditions for subsequent query generation.<n>This approach guides each new query to complement the previous ones, aiding in a thorough understanding of the document.
arXiv Detail & Related papers (2025-02-16T15:59:50Z) - QuIM-RAG: Advancing Retrieval-Augmented Generation with Inverted Question Matching for Enhanced QA Performance [1.433758865948252]
This work presents a novel architecture for building Retrieval-Augmented Generation (RAG) systems.<n>RAG architecture is constructed to generate responses from the target document.<n>We introduce QuIM-RAG, a novel approach for the retrieval mechanism in our system.
arXiv Detail & Related papers (2025-01-06T01:07:59Z) - DMQR-RAG: Diverse Multi-Query Rewriting for RAG [26.518517678671376]
Large language models often encounter challenges with static knowledge and hallucinations, which undermine their reliability.
We introduce DMQR-RAG, a Diverse Multi-Query Rewriting framework to improve the performance of both document retrieval and final responses in RAG.
arXiv Detail & Related papers (2024-11-20T09:43:30Z) - Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers [66.55612528039894]
AdaQR is a framework for training query rewriting models with limited rewrite annotations from seed datasets and completely no passage label.
A novel approach is proposed to assess retriever's preference for these candidates by the probability of answers conditioned on the conversational query.
arXiv Detail & Related papers (2024-06-16T16:09:05Z) - DR-RAG: Applying Dynamic Document Relevance to Retrieval-Augmented Generation for Question-Answering [4.364937306005719]
RAG has recently demonstrated the performance of Large Language Models (LLMs) in the knowledge-intensive tasks such as Question-Answering (QA)
We have found that even though there is low relevance between some critical documents and query, it is possible to retrieve the remaining documents by combining parts of the documents with the query.
A two-stage retrieval framework called Dynamic-Relevant Retrieval-Augmented Generation (DR-RAG) is proposed to improve document retrieval recall and the accuracy of answers.
arXiv Detail & Related papers (2024-06-11T15:15:33Z) - REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering [115.72130322143275]
REAR is a RElevance-Aware Retrieval-augmented approach for open-domain question answering (QA)
We develop a novel architecture for LLM-based RAG systems, by incorporating a specially designed assessment module.
Experiments on four open-domain QA tasks show that REAR significantly outperforms previous a number of competitive RAG approaches.
arXiv Detail & Related papers (2024-02-27T13:22:51Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Knowledge-Aided Open-Domain Question Answering [58.712857964048446]
We propose a knowledge-aided open-domain QA (KAQA) method which targets at improving relevant document retrieval and answer reranking.
During document retrieval, a candidate document is scored by considering its relationship to the question and other documents.
During answer reranking, a candidate answer is reranked using not only its own context but also the clues from other documents.
arXiv Detail & Related papers (2020-06-09T13:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.