ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation
- URL: http://arxiv.org/abs/2405.14125v3
- Date: Thu, 07 Nov 2024 12:07:07 GMT
- Title: ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation
- Authors: Jingnan Zheng, Han Wang, An Zhang, Tai D. Nguyen, Jun Sun, Tat-Seng Chua,
- Abstract summary: Large Language Models (LLMs) can elicit unintended and even harmful content when misaligned with human values.
Current evaluation benchmarks predominantly employ expert-designed contextual scenarios to assess how well LLMs align with human values.
We propose ALI-Agent, an evaluation framework that leverages the autonomous abilities of LLM-powered agents to conduct in-depth and adaptive alignment assessments.
- Score: 48.54271457765236
- License:
- Abstract: Large Language Models (LLMs) can elicit unintended and even harmful content when misaligned with human values, posing severe risks to users and society. To mitigate these risks, current evaluation benchmarks predominantly employ expert-designed contextual scenarios to assess how well LLMs align with human values. However, the labor-intensive nature of these benchmarks limits their test scope, hindering their ability to generalize to the extensive variety of open-world use cases and identify rare but crucial long-tail risks. Additionally, these static tests fail to adapt to the rapid evolution of LLMs, making it hard to evaluate timely alignment issues. To address these challenges, we propose ALI-Agent, an evaluation framework that leverages the autonomous abilities of LLM-powered agents to conduct in-depth and adaptive alignment assessments. ALI-Agent operates through two principal stages: Emulation and Refinement. During the Emulation stage, ALI-Agent automates the generation of realistic test scenarios. In the Refinement stage, it iteratively refines the scenarios to probe long-tail risks. Specifically, ALI-Agent incorporates a memory module to guide test scenario generation, a tool-using module to reduce human labor in tasks such as evaluating feedback from target LLMs, and an action module to refine tests. Extensive experiments across three aspects of human values--stereotypes, morality, and legality--demonstrate that ALI-Agent, as a general evaluation framework, effectively identifies model misalignment. Systematic analysis also validates that the generated test scenarios represent meaningful use cases, as well as integrate enhanced measures to probe long-tail risks. Our code is available at https://github.com/SophieZheng998/ALI-Agent.git
Related papers
- SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs.
Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol.
Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z) - Generative LLM Powered Conversational AI Application for Personalized Risk Assessment: A Case Study in COVID-19 [6.367429891237191]
Large language models (LLMs) have shown remarkable capabilities in various natural language tasks.
This work demonstrates a new LLM-powered disease risk assessment approach via streaming human-AI conversation.
arXiv Detail & Related papers (2024-09-23T13:55:13Z) - Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing [39.93490432227601]
Large Language Models (LLMs) have achieved significant breakthroughs, but their generated unethical content poses potential risks.
Measuring value alignment of LLMs becomes crucial for their regulation and responsible deployment.
We propose GETA, a novel generative evolving testing approach that dynamically probes the underlying moral baselines of LLMs.
arXiv Detail & Related papers (2024-06-20T11:51:00Z) - FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom [19.104850413126066]
Federated Learning (FL) has emerged as a promising solution for collaborative training of large language models (LLMs)
Traditional evaluation methods that rely on labeled test sets and similarity-based metrics cover only a subset of the acceptable answers.
We propose FedEval-LLM that provides reliable performance measurements of LLMs on downstream tasks without the reliance on labeled test sets and external tools.
arXiv Detail & Related papers (2024-04-18T15:46:26Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Walking a Tightrope -- Evaluating Large Language Models in High-Risk
Domains [15.320563604087246]
High-risk domains pose unique challenges that require language models to provide accurate and safe responses.
Despite the great success of large language models (LLMs), their performance in high-risk domains remains unclear.
arXiv Detail & Related papers (2023-11-25T08:58:07Z) - Identifying the Risks of LM Agents with an LM-Emulated Sandbox [68.26587052548287]
Language Model (LM) agents and tools enable a rich set of capabilities but also amplify potential risks.
High cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks.
We introduce ToolEmu: a framework that uses an LM to emulate tool execution and enables the testing of LM agents against a diverse range of tools and scenarios.
arXiv Detail & Related papers (2023-09-25T17:08:02Z) - CValues: Measuring the Values of Chinese Large Language Models from
Safety to Responsibility [62.74405775089802]
We present CValues, the first Chinese human values evaluation benchmark to measure the alignment ability of LLMs.
As a result, we have manually collected adversarial safety prompts across 10 scenarios and induced responsibility prompts from 8 domains.
Our findings suggest that while most Chinese LLMs perform well in terms of safety, there is considerable room for improvement in terms of responsibility.
arXiv Detail & Related papers (2023-07-19T01:22:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.