Related papers: Establishing Performance Baselines in Fine-Tuning, Retrieval-Augmented Generation and Soft-Prompting for Non-Specialist LLM Users

Establishing Performance Baselines in Fine-Tuning, Retrieval-Augmented Generation and Soft-Prompting for Non-Specialist LLM Users

URL: http://arxiv.org/abs/2311.05903v2
Date: Tue, 19 Mar 2024 10:32:16 GMT
Title: Establishing Performance Baselines in Fine-Tuning, Retrieval-Augmented Generation and Soft-Prompting for Non-Specialist LLM Users
Authors: Jennifer Dodgson, Lin Nanzheng, Julian Peh, Akira Rafhael Janson Pattirane, Alfath Daryl Alhajir, Eko Ridho Dinarto, Joseph Lim, Syed Danyal Ahmad,
Abstract summary: In this paper we tested an unmodified version of GPT 3.5, a fine-tuned version, and the same unmodified model when given access to a vectorised RAG database. In each case we tested the model's ability to answer a set of 100 questions relating primarily to events that occurred after September 2021. We found that if commercial platforms are used and default settings are applied with no iteration in order to establish a baseline set of outputs, a fine-tuned model outperforms GPT 3.5 Turbo.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Research into methods for improving the performance of large language models (LLMs) through fine-tuning, retrieval-augmented generation (RAG) and soft-prompting has tended to focus on the use of highly technical or high-cost techniques, making many of the newly discovered approaches comparatively inaccessible to non-technical users. In this paper we tested an unmodified version of GPT 3.5, a fine-tuned version, and the same unmodified model when given access to a vectorised RAG database, both in isolation and in combination with a basic, non-algorithmic soft prompt. In each case we tested the model's ability to answer a set of 100 questions relating primarily to events that occurred after September 2021 (the point at which GPT 3.5's training data set ends). We found that if commercial platforms are used and default settings are applied with no iteration in order to establish a baseline set of outputs, a fine-tuned model outperforms GPT 3.5 Turbo, while the RAG approach out-performed both. The application of a soft prompt significantly improved the performance of each approach.

Related papers

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs [60.38044044203333]
Large language models (LLMs) typically utilize the top-k contexts from a retriever in retrieval-augmented generation (RAG) We propose a novel instruction fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG. For generation, we compare our model with many strong baselines, including GPT-4-0613, GPT-4-turbo-2024-0409, and ChatQA-1.5, an open-sourced model with the state-of-the-art performance on RAG benchmarks.
arXiv Detail & Related papers (2024-07-02T17:59:17Z)
DeCoOp: Robust Prompt Tuning with Out-of-Distribution Detection [52.100335904875614]
We present a novel prompt tuning approach, namely, Decomposed Context Optimization (DeCoOp), which introduces new-class detectors and sub-classifiers to further enhance the base-class and new-class discriminability. Experimental results on 11 benchmark datasets validate the effectiveness of DePT and demonstrate that DeCoOp outperforms current state-of-the-art methods, providing a significant 2% average accuracy improvement.
arXiv Detail & Related papers (2024-06-01T07:46:42Z)
Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data. Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z)
Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks [13.899853299593012]
We evaluate the zero-shot performance of foundational models on 3D VQA benchmarks. We find that GPT-based agents without any fine-tuning perform on par with the closed vocabulary approaches.
arXiv Detail & Related papers (2024-05-29T07:20:28Z)
Parameter-Efficient Fine-Tuning With Adapters [5.948206235442328]
This research introduces a novel adaptation method utilizing the UniPELT framework as a base. Our method employs adapters, which enable efficient transfer of pretrained models to new tasks with minimal retraining of the base model parameters.
arXiv Detail & Related papers (2024-05-09T01:40:38Z)
APrompt4EM: Augmented Prompt Tuning for Generalized Entity Matching [5.92432068962337]
Generalized Entity Matching (GEM) aims at judging whether two records represented in different formats refer to the same real-world entity. This paper introduces an augmented prompt tuning framework for the challenges, which consists of two main improvements.
arXiv Detail & Related papers (2024-05-08T05:38:56Z)
Controllable Prompt Tuning For Balancing Group Distributional Robustness [53.336515056479705]
We introduce an optimization scheme to achieve good performance across groups and find a good solution for all without severely sacrificing performance on any of them. We propose Controllable Prompt Tuning (CPT), which couples our approach with prompt-tuning techniques. On spurious correlation benchmarks, our procedures achieve state-of-the-art results across both transformer and non-transformer architectures, as well as unimodal and multimodal data.
arXiv Detail & Related papers (2024-03-05T06:23:55Z)
Enhancing Large Language Models for Text-to-Testcase Generation [12.864685900686158]
We introduce a text-to-testcase generation approach based on a large language model (GPT-3.5) We evaluate the effectiveness of our approach using a span of five large-scale open-source software projects.
arXiv Detail & Related papers (2024-02-19T07:50:54Z)
Efficient Classification of Student Help Requests in Programming Courses Using Large Language Models [2.5949084781328744]
This study evaluates the performance of the GPT-3.5 and GPT-4 models for classifying help requests from students in an introductory programming class. Fine-tuning the GPT-3.5 model improved its performance to such an extent that it approximated the accuracy and consistency across categories observed between two human raters.
arXiv Detail & Related papers (2023-10-31T00:56:33Z)
Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting [65.00288634420812]
Pairwise Ranking Prompting (PRP) is a technique to significantly reduce the burden on Large Language Models (LLMs) Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs.
arXiv Detail & Related papers (2023-06-30T11:32:25Z)
Train/Test-Time Adaptation with Retrieval [129.8579208970529]
We introduce Train/Test-Time Adaptation with Retrieval ($rm T3AR$), a method to adapt models both at train and test time. $rm T3AR$ adapts a given model to the downstream task using refined pseudo-labels and a self-supervised contrastive objective function. Thanks to the retrieval module, our method gives the user or service provider the possibility to improve model adaptation on the downstream task.
arXiv Detail & Related papers (2023-03-25T02:44:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.