RITFIS: Robust input testing framework for LLMs-based intelligent
software
- URL: http://arxiv.org/abs/2402.13518v1
- Date: Wed, 21 Feb 2024 04:00:54 GMT
- Title: RITFIS: Robust input testing framework for LLMs-based intelligent
software
- Authors: Mingxuan Xiao, Yan Xiao, Hai Dong, Shunhui Ji and Pengcheng Zhang
- Abstract summary: RITFIS is the first framework designed to assess the robustness of intelligent software against natural language inputs.
RITFIS adapts 17 automated testing methods, originally designed for Deep Neural Network (DNN)-based intelligent software.
It demonstrates the effectiveness of RITFIS in evaluating LLM-based intelligent software through empirical validation.
- Score: 6.439196068684973
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The dependence of Natural Language Processing (NLP) intelligent software on
Large Language Models (LLMs) is increasingly prominent, underscoring the
necessity for robustness testing. Current testing methods focus solely on the
robustness of LLM-based software to prompts. Given the complexity and diversity
of real-world inputs, studying the robustness of LLMbased software in handling
comprehensive inputs (including prompts and examples) is crucial for a thorough
understanding of its performance.
To this end, this paper introduces RITFIS, a Robust Input Testing Framework
for LLM-based Intelligent Software. To our knowledge, RITFIS is the first
framework designed to assess the robustness of LLM-based intelligent software
against natural language inputs. This framework, based on given threat models
and prompts, primarily defines the testing process as a combinatorial
optimization problem. Successful test cases are determined by a goal function,
creating a transformation space for the original examples through perturbation
means, and employing a series of search methods to filter cases that meet both
the testing objectives and language constraints. RITFIS, with its modular
design, offers a comprehensive method for evaluating the robustness of LLMbased
intelligent software.
RITFIS adapts 17 automated testing methods, originally designed for Deep
Neural Network (DNN)-based intelligent software, to the LLM-based software
testing scenario. It demonstrates the effectiveness of RITFIS in evaluating
LLM-based intelligent software through empirical validation. However, existing
methods generally have limitations, especially when dealing with lengthy texts
and structurally complex threat models. Therefore, we conducted a comprehensive
analysis based on five metrics and provided insightful testing method
optimization strategies, benefiting both researchers and everyday users.
Related papers
- Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models [76.17975723711886]
Uncertainty quantification (UQ) is a prominent approach for eliciting truthful answers from large language models (LLMs)
In this work, we adapt Mahalanobis Distance (MD) - a well-established UQ technique in classification tasks - for text generation.
Our method extracts token embeddings from multiple layers of LLMs, computes MD scores for each token, and uses linear regression trained on these features to provide robust uncertainty scores.
arXiv Detail & Related papers (2025-02-20T10:25:13Z) - Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving [55.895917967408586]
Existing approaches to mathematical reasoning with large language models rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated Reasoning (TIR) for precise computation.
We propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework that enables LLMs to personalize their reasoning strategy spontaneously.
arXiv Detail & Related papers (2025-02-17T16:56:23Z) - A Systematic Approach for Assessing Large Language Models' Test Case Generation Capability [0.8287206589886879]
We propose the Generated Benchmark from Control-Flow Structure and Variable Usage Composition (GBCV) approach to evaluate large language models (LLMs)
By leveraging basic control-flow structures and variable usage, GBCV provides a flexible framework to create a spectrum of programs ranging from simple to complex.
Our findings indicate that GPT-4o performs better on complex program structures, while all models effectively detect boundary values in simple conditions but face challenges with arithmetic computations.
arXiv Detail & Related papers (2025-02-05T03:51:44Z) - Automated Robustness Testing for LLM-based NLP Software [6.986328098563149]
There are no known automated robustness testing methods specifically designed for LLM-based NLP software.
Existing testing methods can be applied to LLM-based software by AORTA, but their effectiveness is limited.
We propose a novel testing method for LLM-based software within AORTA called Adaptive Beam Search.
arXiv Detail & Related papers (2024-12-30T15:33:34Z) - On the Design and Analysis of LLM-Based Algorithms [74.7126776018275]
Large language models (LLMs) are used as sub-routines in algorithms.
LLMs have achieved remarkable empirical success.
Our proposed framework holds promise for advancing LLM-based algorithms.
arXiv Detail & Related papers (2024-07-20T07:39:07Z) - LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic [2.1073328551105623]
We introduce LLM-ARC, a neuro-symbolic framework designed to enhance the logical reasoning capabilities of Large Language Models (LLMs)
LLMs-ARC employs an Actor-Critic method where the LLM Actor generates declarative logic programs along with tests for semantic correctness, while the Automated Reasoning Critic evaluates the code, runs the tests and provides feedback on test failures for iterative refinement.
Our experiments demonstrate significant improvements over LLM-only baselines, highlighting the importance of logic test generation and iterative self-refinement.
arXiv Detail & Related papers (2024-06-25T15:52:15Z) - Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.
We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.
We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z) - Enhancing LLM-based Test Generation for Hard-to-Cover Branches via Program Analysis [8.31978033489419]
We propose TELPA, a novel technique to generate tests that can reach hard-to-cover branches.
Our experimental results on 27 open-source Python projects demonstrate that TELPA significantly outperforms the state-of-the-art SBST and LLM-based techniques.
arXiv Detail & Related papers (2024-04-07T14:08:28Z) - A Case Study on Test Case Construction with Large Language Models:
Unveiling Practical Insights and Challenges [2.7029792239733914]
This paper examines the application of Large Language Models in the construction of test cases within the context of software engineering.
Through a blend of qualitative and quantitative analyses, this study assesses the impact of LLMs on test case comprehensiveness, accuracy, and efficiency.
arXiv Detail & Related papers (2023-12-19T20:59:02Z) - LM-Polygraph: Uncertainty Estimation for Language Models [71.21409522341482]
Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of large language models (LLMs)
We introduce LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python.
It introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores.
arXiv Detail & Related papers (2023-11-13T15:08:59Z) - SatLM: Satisfiability-Aided Language Models Using Declarative Prompting [68.40726892904286]
We propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of large language models (LLMs)
We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer.
We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
arXiv Detail & Related papers (2023-05-16T17:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.