RITFIS: Robust input testing framework for LLMs-based intelligent
software
- URL: http://arxiv.org/abs/2402.13518v1
- Date: Wed, 21 Feb 2024 04:00:54 GMT
- Title: RITFIS: Robust input testing framework for LLMs-based intelligent
software
- Authors: Mingxuan Xiao, Yan Xiao, Hai Dong, Shunhui Ji and Pengcheng Zhang
- Abstract summary: RITFIS is the first framework designed to assess the robustness of intelligent software against natural language inputs.
RITFIS adapts 17 automated testing methods, originally designed for Deep Neural Network (DNN)-based intelligent software.
It demonstrates the effectiveness of RITFIS in evaluating LLM-based intelligent software through empirical validation.
- Score: 6.439196068684973
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The dependence of Natural Language Processing (NLP) intelligent software on
Large Language Models (LLMs) is increasingly prominent, underscoring the
necessity for robustness testing. Current testing methods focus solely on the
robustness of LLM-based software to prompts. Given the complexity and diversity
of real-world inputs, studying the robustness of LLMbased software in handling
comprehensive inputs (including prompts and examples) is crucial for a thorough
understanding of its performance.
To this end, this paper introduces RITFIS, a Robust Input Testing Framework
for LLM-based Intelligent Software. To our knowledge, RITFIS is the first
framework designed to assess the robustness of LLM-based intelligent software
against natural language inputs. This framework, based on given threat models
and prompts, primarily defines the testing process as a combinatorial
optimization problem. Successful test cases are determined by a goal function,
creating a transformation space for the original examples through perturbation
means, and employing a series of search methods to filter cases that meet both
the testing objectives and language constraints. RITFIS, with its modular
design, offers a comprehensive method for evaluating the robustness of LLMbased
intelligent software.
RITFIS adapts 17 automated testing methods, originally designed for Deep
Neural Network (DNN)-based intelligent software, to the LLM-based software
testing scenario. It demonstrates the effectiveness of RITFIS in evaluating
LLM-based intelligent software through empirical validation. However, existing
methods generally have limitations, especially when dealing with lengthy texts
and structurally complex threat models. Therefore, we conducted a comprehensive
analysis based on five metrics and provided insightful testing method
optimization strategies, benefiting both researchers and everyday users.
Related papers
- On the Design and Analysis of LLM-Based Algorithms [74.7126776018275]
Large language models (LLMs) are used as sub-routines in algorithms.
LLMs have achieved remarkable empirical success.
Our framework holds promise for advancing LLM-based algorithms.
To promote further study of LLM-based algorithms, we release our source code at https://github.com/modelscope/agentscope/tree/main/examples/paper_llm_based_algorithm.
arXiv Detail & Related papers (2024-07-20T07:39:07Z) - LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic [2.1073328551105623]
We introduce LLM-ARC, a neuro-symbolic framework designed to enhance the logical reasoning capabilities of Large Language Models (LLMs)
LLMs-ARC employs an Actor-Critic method where the LLM Actor generates declarative logic programs along with tests for semantic correctness, while the Automated Reasoning Critic evaluates the code, runs the tests and provides feedback on test failures for iterative refinement.
Our experiments demonstrate significant improvements over LLM-only baselines, highlighting the importance of logic test generation and iterative self-refinement.
arXiv Detail & Related papers (2024-06-25T15:52:15Z) - BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce Bench, a benchmark that challenges Large Language Models to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks.
Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%.
arXiv Detail & Related papers (2024-06-22T15:52:04Z) - Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [85.51252685938564]
Uncertainty quantification (UQ) is becoming increasingly recognized as a critical component of applications that rely on machine learning (ML)
As with other ML models, large language models (LLMs) are prone to make incorrect predictions, hallucinate'' by fabricating claims, or simply generate low-quality output for a given input.
We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines, and provides an environment for controllable and consistent evaluation of novel techniques.
arXiv Detail & Related papers (2024-06-21T20:06:31Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Enhancing LLM-based Test Generation for Hard-to-Cover Branches via Program Analysis [8.31978033489419]
We propose TELPA, a novel technique to generate tests that can reach hard-to-cover branches.
Our experimental results on 27 open-source Python projects demonstrate that TELPA significantly outperforms the state-of-the-art SBST and LLM-based techniques.
arXiv Detail & Related papers (2024-04-07T14:08:28Z) - LLMs for Relational Reasoning: How Far are We? [8.840750655261251]
Large language models (LLMs) have revolutionized many areas by achieving state-of-the-art performance on downstream tasks.
Recent efforts have demonstrated that the LLMs are poor at solving sequential decision-making problems.
arXiv Detail & Related papers (2024-01-17T08:22:52Z) - A Case Study on Test Case Construction with Large Language Models:
Unveiling Practical Insights and Challenges [2.7029792239733914]
This paper examines the application of Large Language Models in the construction of test cases within the context of software engineering.
Through a blend of qualitative and quantitative analyses, this study assesses the impact of LLMs on test case comprehensiveness, accuracy, and efficiency.
arXiv Detail & Related papers (2023-12-19T20:59:02Z) - LM-Polygraph: Uncertainty Estimation for Language Models [71.21409522341482]
Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of large language models (LLMs)
We introduce LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python.
It introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores.
arXiv Detail & Related papers (2023-11-13T15:08:59Z) - SatLM: Satisfiability-Aided Language Models Using Declarative Prompting [68.40726892904286]
We propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of large language models (LLMs)
We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer.
We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
arXiv Detail & Related papers (2023-05-16T17:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.