METAL: Metamorphic Testing Framework for Analyzing Large-Language Model
Qualities
- URL: http://arxiv.org/abs/2312.06056v1
- Date: Mon, 11 Dec 2023 01:29:19 GMT
- Title: METAL: Metamorphic Testing Framework for Analyzing Large-Language Model
Qualities
- Authors: Sangwon Hyun, Mingyu Guo, M. Ali Babar
- Abstract summary: Large-Language Models (LLMs) have shifted the paradigm of natural language data processing.
Recent studies have tested Quality Attributes (QAs) of LLMs by generating adversarial input texts.
We propose a MEtamorphic Testing for Analyzing LLMs (METAL) framework to address these issues.
- Score: 4.493507573183107
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-Language Models (LLMs) have shifted the paradigm of natural language
data processing. However, their black-boxed and probabilistic characteristics
can lead to potential risks in the quality of outputs in diverse LLM
applications. Recent studies have tested Quality Attributes (QAs), such as
robustness or fairness, of LLMs by generating adversarial input texts. However,
existing studies have limited their coverage of QAs and tasks in LLMs and are
difficult to extend. Additionally, these studies have only used one evaluation
metric, Attack Success Rate (ASR), to assess the effectiveness of their
approaches. We propose a MEtamorphic Testing for Analyzing LLMs (METAL)
framework to address these issues by applying Metamorphic Testing (MT)
techniques. This approach facilitates the systematic testing of LLM qualities
by defining Metamorphic Relations (MRs), which serve as modularized evaluation
metrics. The METAL framework can automatically generate hundreds of MRs from
templates that cover various QAs and tasks. In addition, we introduced novel
metrics that integrate the ASR method into the semantic qualities of text to
assess the effectiveness of MRs accurately. Through the experiments conducted
with three prominent LLMs, we have confirmed that the METAL framework
effectively evaluates essential QAs on primary LLM tasks and reveals the
quality risks in LLMs. Moreover, the newly proposed metrics can guide the
optimal MRs for testing each task and suggest the most effective method for
generating MRs.
Related papers
- Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement [51.601916604301685]
Large language models (LLMs) generate content that can undermine trust in online discourse.
Current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-AI collaboration.
To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content.
arXiv Detail & Related papers (2024-10-18T08:14:10Z) - Understanding the Role of LLMs in Multimodal Evaluation Benchmarks [77.59035801244278]
This paper investigates the role of the Large Language Model (LLM) backbone in Multimodal Large Language Models (MLLMs) evaluation.
Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs.
Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone.
arXiv Detail & Related papers (2024-10-16T07:49:13Z) - Detecting Training Data of Large Language Models via Expectation Maximization [62.28028046993391]
Membership inference attacks (MIAs) aim to determine whether a specific instance was part of a target model's training data.
Applying MIAs to large language models (LLMs) presents unique challenges due to the massive scale of pre-training data and the ambiguous nature of membership.
We introduce EM-MIA, a novel MIA method for LLMs that iteratively refines membership scores and prefix scores via an expectation-maximization algorithm.
arXiv Detail & Related papers (2024-10-10T03:31:16Z) - MILE: A Mutation Testing Framework of In-Context Learning Systems [5.419884861365132]
We propose a mutation testing framework designed to characterize the quality and effectiveness of test data for ICL systems.
First, we propose several mutation operators specialized for ICL demonstrations, as well as corresponding mutation scores for ICL test sets.
With comprehensive experiments, we showcase the effectiveness of our framework in evaluating the reliability and quality of ICL test suites.
arXiv Detail & Related papers (2024-09-07T13:51:42Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Evaluation and Improvement of Fault Detection for Large Language Models [30.760472387136954]
This paper investigates the effectiveness of existing fault detection methods for large language models (LLMs)
We propose textbfMuCS, a prompt textbfMutation-based prediction textbfConfidence textbfSmoothing framework to boost the fault detection capability of existing methods.
arXiv Detail & Related papers (2024-04-14T07:06:12Z) - Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models [29.202758753639078]
This study investigates the limitations of Multiple Choice Question Answering (MCQA) as an evaluation method for Large Language Models (LLMs)
We propose a dataset augmenting method for Multiple-Choice Questions (MCQs), MCQA+, that can more accurately reflect the performance of the model.
arXiv Detail & Related papers (2024-02-02T12:07:00Z) - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications.
Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences.
We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.