QuaCer-C: Quantitative Certification of Knowledge Comprehension in LLMs
- URL: http://arxiv.org/abs/2402.15929v1
- Date: Sat, 24 Feb 2024 23:16:57 GMT
- Title: QuaCer-C: Quantitative Certification of Knowledge Comprehension in LLMs
- Authors: Isha Chaudhary, Vedaant V. Jain, Gagandeep Singh
- Abstract summary: Large Language Models (LLMs) have demonstrated impressive performance on several benchmarks.
We propose a novel certification framework for LLM, QuaCer-C, wherein we formally certify the knowledge-comprehension capabilities of popular LLMs.
- Score: 3.9648540377345367
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated impressive performance on
several benchmarks. However, traditional studies do not provide formal
guarantees on the performance of LLMs. In this work, we propose a novel
certification framework for LLM, QuaCer-C, wherein we formally certify the
knowledge-comprehension capabilities of popular LLMs. Our certificates are
quantitative - they consist of high-confidence, tight bounds on the probability
that the target LLM gives the correct answer on any relevant knowledge
comprehension prompt. Our certificates for the Llama, Vicuna, and Mistral LLMs
indicate that the knowledge comprehension capability improves with an increase
in the number of parameters and that the Mistral model is less performant than
the rest in this evaluation.
Related papers
- Large Language Models as Reliable Knowledge Bases? [60.25969380388974]
Large Language Models (LLMs) can be viewed as potential knowledge bases (KBs)
This study defines criteria that a reliable LLM-as-KB should meet, focusing on factuality and consistency.
strategies like ICL and fine-tuning are unsuccessful at making LLMs better KBs.
arXiv Detail & Related papers (2024-07-18T15:20:18Z) - PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations [22.011216436252845]
We present PertEval, a toolkit for in-depth probing of large language models' knowledge capacity.
PertEval employs human-like restatement techniques to generate on-the-fly test samples from static benchmarks.
We show that PertEval can act as an essential tool that, when applied alongside any close-ended benchmark, unveils the true knowledge capacity of LLMs.
arXiv Detail & Related papers (2024-05-30T06:38:32Z) - Benchmarking LLMs via Uncertainty Quantification [91.72588235407379]
The proliferation of open-source Large Language Models (LLMs) has highlighted the urgent need for comprehensive evaluation methods.
We introduce a new benchmarking approach for LLMs that integrates uncertainty quantification.
Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs.
arXiv Detail & Related papers (2024-01-23T14:29:17Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - LLMRec: Benchmarking Large Language Models on Recommendation Task [54.48899723591296]
The application of Large Language Models (LLMs) in the recommendation domain has not been thoroughly investigated.
We benchmark several popular off-the-shelf LLMs on five recommendation tasks, including rating prediction, sequential recommendation, direct recommendation, explanation generation, and review summarization.
The benchmark results indicate that LLMs displayed only moderate proficiency in accuracy-based tasks such as sequential and direct recommendation.
arXiv Detail & Related papers (2023-08-23T16:32:54Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z) - Investigating the Factual Knowledge Boundary of Large Language Models
with Retrieval Augmentation [91.30946119104111]
We show that large language models (LLMs) possess unwavering confidence in their capabilities to respond to questions.
Retrieval augmentation proves to be an effective approach in enhancing LLMs' awareness of knowledge boundaries.
We also find that LLMs have a propensity to rely on the provided retrieval results when formulating answers.
arXiv Detail & Related papers (2023-07-20T16:46:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.