Knowledge-based Consistency Testing of Large Language Models
- URL: http://arxiv.org/abs/2407.12830v2
- Date: Sat, 5 Oct 2024 14:12:11 GMT
- Title: Knowledge-based Consistency Testing of Large Language Models
- Authors: Sai Sathiesh Rajan, Ezekiel Soremekun, Sudipta Chattopadhyay,
- Abstract summary: We systematically expose and measure the inconsistency and knowledge gaps of Large Language Models (LLMs)
We propose an automated testing framework (called KonTest) which leverages a knowledge graph to construct test cases.
Our ablation study further shows that GPT3.5 is not suitable for knowledge-based consistency testing because it is only 60%-68% effective in knowledge construction.
- Score: 2.9699290794642366
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this work, we systematically expose and measure the inconsistency and knowledge gaps of Large Language Models (LLMs). Specifically, we propose an automated testing framework (called KonTest) which leverages a knowledge graph to construct test cases. KonTest probes and measures the inconsistencies in the LLM's knowledge of the world via a combination of semantically-equivalent queries and test oracles (metamorphic or ontological oracle). KonTest further mitigates knowledge gaps via a weighted LLM model ensemble. Using four state-of-the-art LLMs (Falcon, Gemini, GPT3.5, and Llama2), we show that KonTest generates 19.2% error inducing inputs (1917 errors from 9979 test inputs). It also reveals a 16.5% knowledge gap across all tested LLMs. A mitigation method informed by KonTest's test suite reduces LLM knowledge gap by 32.48%. Our ablation study further shows that GPT3.5 is not suitable for knowledge-based consistency testing because it is only 60%-68% effective in knowledge construction.
Related papers
- Test smells in LLM-Generated Unit Tests [11.517293765116307]
This study explores the diffusion of test smells in Large Language Models generated unit test suites.
We analyze a benchmark of 20,500 LLM-generated test suites produced by four models across five prompt engineering techniques.
We identify and analyze the prevalence and co-occurrence of various test smells in both human written and LLM-generated test suites.
arXiv Detail & Related papers (2024-10-14T15:35:44Z) - On the Effectiveness of LLMs for Manual Test Verifications [1.920300814128832]
This study aims to explore the use of Large Language Models (LLMs) to produce verifications for manual tests.
Open-source models Mistral-7B and Phi-3-mini-4k demonstrated effectiveness and consistency comparable to closed-source models.
There were also concerns about AI hallucinations, where verifications significantly deviated from expectations.
arXiv Detail & Related papers (2024-09-19T02:03:04Z) - Rethinking the Influence of Source Code on Test Case Generation [22.168699378889148]
Large language models (LLMs) have been widely applied to assist test generation with the source code under test provided as the context.
This paper aims to answer the question: If the source code under test is incorrect, will LLMs be misguided when generating tests?
Our evaluation results demonstrate that incorrect code can significantly mislead LLMs in generating correct, high-coverage, and bug-revealing tests.
arXiv Detail & Related papers (2024-09-14T15:17:34Z) - Improving LLM-based Unit test generation via Template-based Repair [8.22619177301814]
Unit test is crucial for detecting bugs in individual program units but consumes time and effort.
Large language models (LLMs) have demonstrated remarkable reasoning and generation capabilities.
In this paper, we propose TestART, a novel unit test generation method.
arXiv Detail & Related papers (2024-08-06T10:52:41Z) - Constrained C-Test Generation via Mixed-Integer Programming [55.28927994487036]
This work proposes a novel method to generate C-Tests; a form of cloze tests (a gap filling exercise) where only the last part of a word is turned into a gap.
In contrast to previous works that only consider varying the gap size or gap placement to achieve locally optimal solutions, we propose a mixed-integer programming (MIP) approach.
We publish our code, model, and collected data consisting of 32 English C-Tests with 20 gaps each (totaling 3,200 individual gap responses) under an open source license.
arXiv Detail & Related papers (2024-04-12T21:35:21Z) - KnowTuning: Knowledge-aware Fine-tuning for Large Language Models [83.5849717262019]
We propose a knowledge-aware fine-tuning (KnowTuning) method to improve fine-grained and coarse-grained knowledge awareness of LLMs.
KnowTuning generates more facts with less factual error rate under fine-grained facts evaluation.
arXiv Detail & Related papers (2024-02-17T02:54:32Z) - The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning.
Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education.
We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z) - An empirical study of testing machine learning in the wild [35.13282520395855]
Machine and deep learning (ML/DL) algorithms have been increasingly adopted in many software systems.
Due to their inductive nature, ensuring the quality of these systems remains a significant challenge for the research community.
Recent research in ML/DL quality assurance has adapted concepts from traditional software testing, such as mutation testing, to improve reliability.
arXiv Detail & Related papers (2023-12-19T21:18:14Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - Statistical Knowledge Assessment for Large Language Models [79.07989821512128]
Given varying prompts regarding a factoid question, can a large language model (LLM) reliably generate factually correct answers?
We propose KaRR, a statistical approach to assess factual knowledge for LLMs.
Our results reveal that the knowledge in LLMs with the same backbone architecture adheres to the scaling law, while tuning on instruction-following data sometimes compromises the model's capability to generate factually correct text reliably.
arXiv Detail & Related papers (2023-05-17T18:54:37Z) - Knowledge Rumination for Pre-trained Language Models [77.55888291165462]
We propose a new paradigm dubbed Knowledge Rumination to help the pre-trained language model utilize related latent knowledge without retrieving it from the external corpus.
We apply the proposed knowledge rumination to various language models, including RoBERTa, DeBERTa, and GPT-3.
arXiv Detail & Related papers (2023-05-15T15:47:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.