MeTMaP: Metamorphic Testing for Detecting False Vector Matching Problems
in LLM Augmented Generation
- URL: http://arxiv.org/abs/2402.14480v1
- Date: Thu, 22 Feb 2024 12:13:35 GMT
- Title: MeTMaP: Metamorphic Testing for Detecting False Vector Matching Problems
in LLM Augmented Generation
- Authors: Guanyu Wang, Yuekang Li, Yi Liu, Gelei Deng, Tianlin Li, Guosheng Xu,
Yang Liu, Haoyu Wang, Kailong Wang
- Abstract summary: This paper presents MeTMaP, a framework developed to identify false vector matching in LLM-augmented generation systems.
MeTMaP is based on the idea that semantically similar texts should match and dissimilar ones should not.
Our evaluation of MeTMaP over 203 vector matching configurations, involving 29 embedding models and 7 distance metrics, uncovers significant inaccuracies.
- Score: 15.382745718541063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Augmented generation techniques such as Retrieval-Augmented Generation (RAG)
and Cache-Augmented Generation (CAG) have revolutionized the field by enhancing
large language model (LLM) outputs with external knowledge and cached
information. However, the integration of vector databases, which serve as a
backbone for these augmentations, introduces critical challenges, particularly
in ensuring accurate vector matching. False vector matching in these databases
can significantly compromise the integrity and reliability of LLM outputs,
leading to misinformation or erroneous responses. Despite the crucial impact of
these issues, there is a notable research gap in methods to effectively detect
and address false vector matches in LLM-augmented generation. This paper
presents MeTMaP, a metamorphic testing framework developed to identify false
vector matching in LLM-augmented generation systems. We derive eight
metamorphic relations (MRs) from six NLP datasets, which form our method's
core, based on the idea that semantically similar texts should match and
dissimilar ones should not. MeTMaP uses these MRs to create sentence triplets
for testing, simulating real-world LLM scenarios. Our evaluation of MeTMaP over
203 vector matching configurations, involving 29 embedding models and 7
distance metrics, uncovers significant inaccuracies. The results, showing a
maximum accuracy of only 41.51\% on our tests compared to the original
datasets, emphasize the widespread issue of false matches in vector matching
methods and the critical need for effective detection and mitigation in
LLM-augmented applications.
Related papers
- Invar-RAG: Invariant LLM-aligned Retrieval for Better Generation [43.630437906898635]
We propose a novel two-stage fine-tuning architecture called Invar-RAG.
In the retrieval stage, an LLM-based retriever is constructed by integrating LoRA-based representation learning.
In the generation stage, a refined fine-tuning method is employed to improve LLM accuracy in generating answers based on retrieved information.
arXiv Detail & Related papers (2024-11-11T14:25:37Z) - Robust Detection of LLM-Generated Text: A Comparative Analysis [0.276240219662896]
Large language models can be widely integrated into many aspects of life, and their output can quickly fill all network resources.
It becomes increasingly important to develop powerful detectors for the generated text.
This detector is essential to prevent the potential misuse of these technologies and to protect areas such as social media from the negative effects.
arXiv Detail & Related papers (2024-11-09T18:27:15Z) - Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output [49.893971654861424]
We present a light-weight approach for detecting nonfactual outputs from retrieval-augmented generation (RAG)
We compute a factuality score that can be thresholded to yield a binary decision.
Our experiments show high area under the ROC curve (AUC) across a wide range of relevant open source datasets.
arXiv Detail & Related papers (2024-11-01T20:44:59Z) - Exploring and Lifting the Robustness of LLM-powered Automated Program Repair with Metamorphic Testing [31.165102332393964]
Large language model-powered Automated Program Repair (LAPR) techniques have achieved state-of-the-art bug-fixing performance.
It is crucial to conduct robustness testing on LAPR techniques before their practical deployment.
We propose MT-LAPR, a Metamorphic Testing framework exclusively for LAPR techniques.
arXiv Detail & Related papers (2024-10-10T01:14:58Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Anomaly Detection of Tabular Data Using LLMs [54.470648484612866]
We show that pre-trained large language models (LLMs) are zero-shot batch-level anomaly detectors.
We propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies.
arXiv Detail & Related papers (2024-06-24T04:17:03Z) - DALD: Improving Logits-based Detector without Logits from Black-box LLMs [56.234109491884126]
Large Language Models (LLMs) have revolutionized text generation, producing outputs that closely mimic human writing.
We present Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection.
DALD is designed to align the surrogate model's distribution with that of unknown target LLMs, ensuring enhanced detection capability and resilience against rapid model iterations.
arXiv Detail & Related papers (2024-06-07T19:38:05Z) - MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization [86.61052121715689]
MatPlotAgent is a model-agnostic framework designed to automate scientific data visualization tasks.
MatPlotBench is a high-quality benchmark consisting of 100 human-verified test cases.
arXiv Detail & Related papers (2024-02-18T04:28:28Z) - Benchmarking Causal Study to Interpret Large Language Models for Source
Code [6.301373791541809]
This paper introduces a benchmarking strategy named Galeras comprised of curated testbeds for three SE tasks.
We illustrate the insights of our benchmarking strategy by conducting a case study on the performance of ChatGPT under distinct prompt engineering methods.
arXiv Detail & Related papers (2023-08-23T20:32:12Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.