MeTMaP: Metamorphic Testing for Detecting False Vector Matching Problems
in LLM Augmented Generation
- URL: http://arxiv.org/abs/2402.14480v1
- Date: Thu, 22 Feb 2024 12:13:35 GMT
- Title: MeTMaP: Metamorphic Testing for Detecting False Vector Matching Problems
in LLM Augmented Generation
- Authors: Guanyu Wang, Yuekang Li, Yi Liu, Gelei Deng, Tianlin Li, Guosheng Xu,
Yang Liu, Haoyu Wang, Kailong Wang
- Abstract summary: This paper presents MeTMaP, a framework developed to identify false vector matching in LLM-augmented generation systems.
MeTMaP is based on the idea that semantically similar texts should match and dissimilar ones should not.
Our evaluation of MeTMaP over 203 vector matching configurations, involving 29 embedding models and 7 distance metrics, uncovers significant inaccuracies.
- Score: 15.382745718541063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Augmented generation techniques such as Retrieval-Augmented Generation (RAG)
and Cache-Augmented Generation (CAG) have revolutionized the field by enhancing
large language model (LLM) outputs with external knowledge and cached
information. However, the integration of vector databases, which serve as a
backbone for these augmentations, introduces critical challenges, particularly
in ensuring accurate vector matching. False vector matching in these databases
can significantly compromise the integrity and reliability of LLM outputs,
leading to misinformation or erroneous responses. Despite the crucial impact of
these issues, there is a notable research gap in methods to effectively detect
and address false vector matches in LLM-augmented generation. This paper
presents MeTMaP, a metamorphic testing framework developed to identify false
vector matching in LLM-augmented generation systems. We derive eight
metamorphic relations (MRs) from six NLP datasets, which form our method's
core, based on the idea that semantically similar texts should match and
dissimilar ones should not. MeTMaP uses these MRs to create sentence triplets
for testing, simulating real-world LLM scenarios. Our evaluation of MeTMaP over
203 vector matching configurations, involving 29 embedding models and 7
distance metrics, uncovers significant inaccuracies. The results, showing a
maximum accuracy of only 41.51\% on our tests compared to the original
datasets, emphasize the widespread issue of false matches in vector matching
methods and the critical need for effective detection and mitigation in
LLM-augmented applications.
Related papers
- Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Anomaly Detection of Tabular Data Using LLMs [54.470648484612866]
We show that pre-trained large language models (LLMs) are zero-shot batch-level anomaly detectors.
We propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies.
arXiv Detail & Related papers (2024-06-24T04:17:03Z) - Improving Logits-based Detector without Logits from Black-box LLMs [56.234109491884126]
Large Language Models (LLMs) have revolutionized text generation, producing outputs that closely mimic human writing.
We present Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection.
DALD is designed to align the surrogate model's distribution with that of unknown target LLMs, ensuring enhanced detection capability and resilience against rapid model iterations.
arXiv Detail & Related papers (2024-06-07T19:38:05Z) - MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization [86.61052121715689]
MatPlotAgent is a model-agnostic framework designed to automate scientific data visualization tasks.
MatPlotBench is a high-quality benchmark consisting of 100 human-verified test cases.
arXiv Detail & Related papers (2024-02-18T04:28:28Z) - Benchmarking Causal Study to Interpret Large Language Models for Source
Code [6.301373791541809]
This paper introduces a benchmarking strategy named Galeras comprised of curated testbeds for three SE tasks.
We illustrate the insights of our benchmarking strategy by conducting a case study on the performance of ChatGPT under distinct prompt engineering methods.
arXiv Detail & Related papers (2023-08-23T20:32:12Z) - Efficient Detection of LLM-generated Texts with a Bayesian Surrogate Model [14.98695074168234]
We propose a new method to detect machine-generated text, especially from large language models (LLMs)
We use a Bayesian surrogate model, which allows us to select typical samples based on Bayesian uncertainty and interpolate scores from typical samples to other samples, to improve query efficiency.
Empirical results demonstrate that our method significantly outperforms existing approaches under a low query budget.
arXiv Detail & Related papers (2023-05-26T04:23:10Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Mixture of Soft Prompts for Controllable Data Generation [21.84489422361048]
Mixture of Soft Prompts (MSP) is proposed as a tool for data augmentation rather than direct prediction.
Our method achieves state-of-the-art results on three benchmarks when compared against strong baselines.
arXiv Detail & Related papers (2023-03-02T21:13:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.