Reformulating Domain Adaptation of Large Language Models as
Adapt-Retrieve-Revise
- URL: http://arxiv.org/abs/2310.03328v2
- Date: Thu, 12 Oct 2023 22:14:52 GMT
- Title: Reformulating Domain Adaptation of Large Language Models as
Adapt-Retrieve-Revise
- Authors: Zhen wan, Yating Zhang, Yexiang Wang, Fei Cheng, Sadao Kurohashi
- Abstract summary: GPT-4 can generate content with hallucinations in specific domains such as Chinese law, hindering their application in these areas.
This paper introduces a simple and effective domain adaptation framework for GPT-4 by reformulating generation as an textbfadapt-retrieve-revise process.
In the zero-shot setting of four Chinese legal tasks, our method improves accuracy by 33.3% compared to the direct generation by GPT-4.
- Score: 34.4546877502907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While large language models (LLMs) like GPT-4 have recently demonstrated
astonishing zero-shot capabilities in general domain tasks, they often generate
content with hallucinations in specific domains such as Chinese law, hindering
their application in these areas. This is typically due to the absence of
training data that encompasses such a specific domain, preventing GPT-4 from
acquiring in-domain knowledge. A pressing challenge is that it's not plausible
to continue training LLMs of such scale on in-domain data.
This paper introduces a simple and effective domain adaptation framework for
GPT-4 by reformulating generation as an \textbf{adapt-retrieve-revise} process.
The initial step is to \textbf{adapt} an affordable 7B LLM to the target domain
by continuing learning on in-domain data. When solving a task, we leverage the
adapted LLM to generate a draft answer given a task query. Then, the draft
answer will be used to \textbf{retrieve} supporting evidence candidates from an
external in-domain knowledge base. Finally, the draft answer and retrieved
evidence are concatenated into a whole prompt to let GPT-4 assess the evidence
and \textbf{revise} the draft answer to generate the final answer.
Our proposal combines the advantages of the efficiency of adapting a smaller
7B model with the evidence-assessing capability of GPT-4 and effectively
prevents GPT-4 from generating hallucinatory content. In the zero-shot setting
of four Chinese legal tasks, our method improves accuracy by 33.3\% compared to
the direct generation by GPT-4. When compared to two stronger retrieval-based
baselines, our method outperforms them by 15.4\% and 23.9\%. Our code will be
released
Related papers
- RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs [60.38044044203333]
Large language models (LLMs) typically utilize the top-k contexts from a retriever in retrieval-augmented generation (RAG)
We propose a novel instruction fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG.
For generation, we compare our model with many strong baselines, including GPT-4-0613, GPT-4-turbo-2024-0409, and ChatQA-1.5, an open-sourced model with the state-of-the-art performance on RAG benchmarks.
arXiv Detail & Related papers (2024-07-02T17:59:17Z) - From Tarzan to Tolkien: Controlling the Language Proficiency Level of LLMs for Content Generation [10.009516150364371]
We evaluate the effectiveness of several key approaches for this task, including few-shot prompting, supervised finetuning, and reinforcement learning.
Our findings reveal a large performance gap between GPT-4 and the open source models when using prompt-based strategies.
Our best model, CALM (CEFR-Aligned Language Model), surpasses the performance of GPT-4 and other strategies, at only a fraction of the cost.
arXiv Detail & Related papers (2024-06-05T07:57:17Z) - Constrained C-Test Generation via Mixed-Integer Programming [55.28927994487036]
This work proposes a novel method to generate C-Tests; a form of cloze tests (a gap filling exercise) where only the last part of a word is turned into a gap.
In contrast to previous works that only consider varying the gap size or gap placement to achieve locally optimal solutions, we propose a mixed-integer programming (MIP) approach.
We publish our code, model, and collected data consisting of 32 English C-Tests with 20 gaps each (totaling 3,200 individual gap responses) under an open source license.
arXiv Detail & Related papers (2024-04-12T21:35:21Z) - GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection [51.43589678946244]
This paper explores the potential of VQA-oriented GPT-4V in the popular visual Anomaly Detection (AD) task.
It is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets.
arXiv Detail & Related papers (2023-11-05T10:01:18Z) - Prompt Engineering or Fine Tuning: An Empirical Assessment of Large
Language Models in Automated Software Engineering Tasks [8.223311621898983]
GPT-4 with conversational prompts showed drastic improvement compared to GPT-4 with automatic prompting strategies.
fully automated prompt engineering with no human in the loop requires more study and improvement.
arXiv Detail & Related papers (2023-10-11T00:21:00Z) - Instances Need More Care: Rewriting Prompts for Instances with LLMs in the Loop Yields Better Zero-Shot Performance [11.595274304409937]
Large language models (LLMs) have revolutionized zero-shot task performance.
Current methods using trigger phrases such as "Let's think step by step" remain limited.
This study introduces PRomPTed, an approach that optimize the zero-shot prompts for individual task instances.
arXiv Detail & Related papers (2023-10-03T14:51:34Z) - Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with
Code-based Self-Verification [40.83776920225375]
OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets.
We propose a novel and effective prompting method, explicit ulinecode-based ulineself-ulineverification(CSV)
We achieve an impressive zero-shot accuracy on MATH dataset textbf(53.9% $to$ 84.3%).
arXiv Detail & Related papers (2023-08-15T17:58:45Z) - Generalized Planning in PDDL Domains with Pretrained Large Language
Models [82.24479434984426]
We consider PDDL domains and use GPT-4 to synthesize Python programs.
We evaluate this approach in seven PDDL domains and compare it to four ablations and four baselines.
arXiv Detail & Related papers (2023-05-18T14:48:20Z) - Can GPT-4 Perform Neural Architecture Search? [56.98363718371614]
We investigate the potential of GPT-4 to perform Neural Architecture Search (NAS)
Our proposed approach, textbfGPT-4 textbfEnhanced textbfNeural archtextbfItecttextbfUre textbfSearch (GENIUS)
We assess GENIUS across several benchmarks, comparing it with existing state-of-the-art NAS techniques to illustrate its effectiveness.
arXiv Detail & Related papers (2023-04-21T14:06:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.