She had Cobalt Blue Eyes: Prompt Testing to Create Aligned and
Sustainable Language Models
- URL: http://arxiv.org/abs/2310.18333v3
- Date: Fri, 15 Dec 2023 15:45:51 GMT
- Title: She had Cobalt Blue Eyes: Prompt Testing to Create Aligned and
Sustainable Language Models
- Authors: Veronica Chatrath, Oluwanifemi Bamgbose, Shaina Raza
- Abstract summary: Recent events indicate ethical concerns around conventionally trained large language models (LLMs)
We introduce a test suite of prompts to foster the development of aligned LLMs that are fair, safe, and robust.
Our test suite evaluates outputs from four state-of-the-art language models: GPT-3.5, GPT-4, OPT, and LLaMA-2.
- Score: 2.6089354079273512
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As the use of large language models (LLMs) increases within society, as does
the risk of their misuse. Appropriate safeguards must be in place to ensure LLM
outputs uphold the ethical standards of society, highlighting the positive role
that artificial intelligence technologies can have. Recent events indicate
ethical concerns around conventionally trained LLMs, leading to overall unsafe
user experiences. This motivates our research question: how do we ensure LLM
alignment? In this work, we introduce a test suite of unique prompts to foster
the development of aligned LLMs that are fair, safe, and robust. We show that
prompting LLMs at every step of the development pipeline, including data
curation, pre-training, and fine-tuning, will result in an overall more
responsible model. Our test suite evaluates outputs from four state-of-the-art
language models: GPT-3.5, GPT-4, OPT, and LLaMA-2. The assessment presented in
this paper highlights a gap between societal alignment and the capabilities of
current LLMs. Additionally, implementing a test suite such as ours lowers the
environmental overhead of making models safe and fair.
Related papers
- SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types [21.683010095703832]
We develop a novel benchmark to assess the generalization of large language model (LLM) safety across various tasks and prompt types.
This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak on LLM safety.
Our assessment reveals that most LLMs perform worse on discriminative tasks than generative ones, and are highly susceptible to prompts, indicating poor generalization in safety alignment.
arXiv Detail & Related papers (2024-10-29T11:47:01Z) - LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification Testsuites [6.796136787585992]
Large Language Models (LLM) are evolving and have significantly revolutionized the landscape of software development.
This paper explores the idea of judging tests used to evaluate compiler implementations of directive-based programming models.
arXiv Detail & Related papers (2024-08-21T15:54:17Z) - Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing [39.93490432227601]
Large Language Models (LLMs) have achieved significant breakthroughs, but their generated unethical content poses potential risks.
Measuring value alignment of LLMs becomes crucial for their regulation and responsible deployment.
We propose GETA, a novel generative evolving testing approach that dynamically probes the underlying moral baselines of LLMs.
arXiv Detail & Related papers (2024-06-20T11:51:00Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy.
It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z) - $\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks [21.12437562185667]
This paper presents a new approach for scaling LLM assessment in translating formal syntax to natural language.
We use context-free grammars (CFGs) to generate out-of-distribution datasets on the fly.
We also conduct an assessment of several SOTA closed and open-source LLMs to showcase the feasibility and scalability of this paradigm.
arXiv Detail & Related papers (2024-03-27T08:08:00Z) - TroubleLLM: Align to Red Team Expert [36.05032354083237]
Large Language Models (LLMs) can be potentially harmful in manifesting undesirable safety issues.
We propose the first LLM, called TroubleLLM, to generate controllable test prompts on safety issues.
arXiv Detail & Related papers (2024-02-28T03:40:46Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - LM-Polygraph: Uncertainty Estimation for Language Models [71.21409522341482]
Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of large language models (LLMs)
We introduce LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python.
It introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores.
arXiv Detail & Related papers (2023-11-13T15:08:59Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models.
We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.