Generation, Distillation and Evaluation of Motivational
Interviewing-Style Reflections with a Foundational Language Model
- URL: http://arxiv.org/abs/2402.01051v1
- Date: Thu, 1 Feb 2024 22:54:31 GMT
- Title: Generation, Distillation and Evaluation of Motivational
Interviewing-Style Reflections with a Foundational Language Model
- Authors: Andrew Brown, Jiading Zhu, Mohamed Abdelwahab, Alec Dong, Cindy Wang,
Jonathan Rose
- Abstract summary: We present a method for distilling the generation of reflections from a Foundational Language Model into smaller models.
We first show that GPT-4, using zero-shot prompting, can generate reflections at near 100% success rate.
We also show that GPT-4 can help in the labor-intensive task of evaluating the quality of the distilled models.
- Score: 2.33956825429387
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Foundational Language Models are capable of performing many tasks at a
high level but are difficult to deploy in many applications because of their
size and proprietary ownership. Many will be motivated to distill specific
capabilities of foundational models into smaller models that can be owned and
controlled. In the development of a therapeutic chatbot, we wish to distill a
capability known as reflective listening, in which a therapist produces
reflections of client speech. These reflections either restate what a client
has said, or connect what was said to a relevant observation, idea or guess
that encourages and guides the client to continue contemplation. In this paper,
we present a method for distilling the generation of reflections from a
Foundational Language Model (GPT-4) into smaller models. We first show that
GPT-4, using zero-shot prompting, can generate reflections at near 100% success
rate, superior to all previous methods. Using reflections generated by GPT-4,
we fine-tune different sizes of the GPT-2 family. The GPT-2-small model
achieves 83% success on a hold-out test set and the GPT-2 XL achieves 90%
success. We also show that GPT-4 can help in the labor-intensive task of
evaluating the quality of the distilled models, using it as a zero-shot
classifier. Using triple-human review as a guide, the classifier achieves a
Cohen-Kappa of 0.66, a substantial inter-rater reliability figure.
Related papers
- Teaching Language Models to Self-Improve by Learning from Language Feedback [40.649677201161744]
We present Self-Refinement Tuning (SRT), a method that leverages model feedback for alignment.
SRT uses a base language model (e.g., Tulu2) to generate initial responses, which are critiqued and refined by a more advanced model.
SRT further optimize the model by learning from its self-generated feedback and refinements, creating a feedback loop that promotes model improvement.
arXiv Detail & Related papers (2024-06-11T11:20:05Z) - On Zero-Shot Counterspeech Generation by LLMs [23.39818166945086]
We present a comprehensive analysis of the performances of four Large Language Models (LLM) in zero-shot settings for counterspeech generation.
Considering type of model, GPT-2 and FlanT5 models are significantly better in terms of counterspeech quality.
ChatGPT are much better at generating counter speech than other models across all metrics.
arXiv Detail & Related papers (2024-03-22T04:13:10Z) - How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship.
We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3.
While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z) - NERIF: GPT-4V for Automatic Scoring of Drawn Models [0.6278186810520364]
Recently released GPT-4V provides a unique opportunity to advance scientific modeling practices.
We developed a method employing instructional note and rubrics to prompt GPT-4V to score students' drawn models.
GPT-4V scores were compared with human experts' scores to calculate scoring accuracy.
arXiv Detail & Related papers (2023-11-21T20:52:04Z) - Assessing the efficacy of large language models in generating accurate
teacher responses [0.5774786149181391]
This study attempts to assess the generative abilities of large language models in providing informative and helpful insights to students.
We present an extensive evaluation of several benchmarking generative models, including GPT-4 (few-shot, in-context learning), fine-tuned GPT-2, and fine-tuned DialoGPT.
Our experimental findings on the Teacher-Student Chatroom subset indicate the efficacy of GPT-4 over other fine-tuned models, measured using BERTScore and DialogRPT.
arXiv Detail & Related papers (2023-07-09T22:32:46Z) - The False Promise of Imitating Proprietary LLMs [158.65692029352584]
An emerging method to cheaply improve a weaker language model is to finetune it on outputs from a stronger model.
This approach looks to cheaply imitate the proprietary model's capabilities using a weaker open-source model.
We first finetune a series of LMs that imitate ChatGPT using varying base model sizes.
We then evaluate the models using crowd raters and canonical NLP benchmarks.
arXiv Detail & Related papers (2023-05-25T05:00:12Z) - RL4F: Generating Natural Language Feedback with Reinforcement Learning
for Repairing Model Outputs [27.777809444120827]
Previous work proposed providing language models with natural language feedback to guide them in repairing their outputs.
We introduce RL4F, a multi-agent collaborative framework where critique generator is trained to maximize end-task performance of GPT-3.
We show relative improvements up to 10% in multiple text similarity metrics over other learned, retrieval-augmented or prompting-based critique generators.
arXiv Detail & Related papers (2023-05-15T17:57:16Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z) - Elaboration-Generating Commonsense Question Answering at Scale [77.96137534751445]
In question answering requiring common sense, language models (e.g., GPT-3) have been used to generate text expressing background knowledge.
We finetune smaller language models to generate useful intermediate context, referred to here as elaborations.
Our framework alternates between updating two language models -- an elaboration generator and an answer predictor -- allowing each to influence the other.
arXiv Detail & Related papers (2022-09-02T18:32:09Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z) - Reframing Instructional Prompts to GPTk's Language [72.69833640335519]
We propose reframing techniques for model designers to create effective prompts for language models.
Our results show that reframing improves few-shot learning performance by 14% while reducing sample complexity.
The performance gains are particularly important on large language models, such as GPT3 where tuning models or prompts on large datasets is not feasible.
arXiv Detail & Related papers (2021-09-16T09:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.