Context Variance Evaluation of Pretrained Language Models for
Prompt-based Biomedical Knowledge Probing
- URL: http://arxiv.org/abs/2211.10265v1
- Date: Fri, 18 Nov 2022 14:44:09 GMT
- Title: Context Variance Evaluation of Pretrained Language Models for
Prompt-based Biomedical Knowledge Probing
- Authors: Zonghai Yao, Yi Cao, Zhichao Yang, Hong Yu
- Abstract summary: We show that prompt-based probing methods can only probe a lower bound of knowledge.
We introduce context variance into the prompt generation and propose a new rank-change-based evaluation metric.
- Score: 9.138354194112395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained language models (PLMs) have motivated research on what kinds of
knowledge these models learn. Fill-in-the-blanks problem (e.g., cloze tests) is
a natural approach for gauging such knowledge. BioLAMA generates prompts for
biomedical factual knowledge triples and uses the Top-k accuracy metric to
evaluate different PLMs' knowledge. However, existing research has shown that
such prompt-based knowledge probing methods can only probe a lower bound of
knowledge. Many factors like prompt-based probing biases make the LAMA
benchmark unreliable and unstable. This problem is more prominent in BioLAMA.
The severe long-tailed distribution in vocabulary and large-N-M relation make
the performance gap between LAMA and BioLAMA remain notable. To address these,
we introduce context variance into the prompt generation and propose a new
rank-change-based evaluation metric. Different from the previous known-unknown
evaluation criteria, we propose the concept of "Misunderstand" in LAMA for the
first time. Through experiments on 12 PLMs, our context variance prompts and
Understand-Confuse-Misunderstand (UCM) metric makes BioLAMA more friendly to
large-N-M relations and rare relations. We also conducted a set of control
experiments to disentangle "understand" from just "read and copy".
Related papers
- Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored.
Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.
We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z) - Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models [55.332004960574004]
Large language models (LLMs) are widely used in decision-making, but their reliability, especially in critical tasks like healthcare, is not well-established.
This paper investigates how the uncertainty of responses generated by LLMs relates to the information provided in the input prompt.
We propose a prompt-response concept model that explains how LLMs generate responses and helps understand the relationship between prompts and response uncertainty.
arXiv Detail & Related papers (2024-07-20T11:19:58Z) - Uncertainty Estimation of Large Language Models in Medical Question Answering [60.72223137560633]
Large Language Models (LLMs) show promise for natural language generation in healthcare, but risk hallucinating factually incorrect information.
We benchmark popular uncertainty estimation (UE) methods with different model sizes on medical question-answering datasets.
Our results show that current approaches generally perform poorly in this domain, highlighting the challenge of UE for medical applications.
arXiv Detail & Related papers (2024-07-11T16:51:33Z) - What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models [15.057992220389604]
Language models often struggle with handling factual knowledge, exhibiting factual hallucination issue.
We introduce a knowledge probing benchmark, BELIEF(ICL), to evaluate the knowledge recall ability of both encoder- and decoder-based pre-trained language models.
We semi-automatically create MyriadLAMA, which has massively diverse prompts.
arXiv Detail & Related papers (2024-06-18T05:11:35Z) - Towards Reliable Latent Knowledge Estimation in LLMs: Zero-Prompt Many-Shot Based Factual Knowledge Extraction [15.534647327246239]
We propose to eliminate prompt engineering when probing large language models (LLMs) for factual knowledge.
Our approach, called Zero-Prompt Latent Knowledge Estimator (ZP-LKE), leverages the in-context learning ability of LLMs.
We perform a large-scale evaluation of the factual knowledge of a variety of open-source LLMs over a large set of relations and facts from the Wikidata knowledge base.
arXiv Detail & Related papers (2024-04-19T15:40:39Z) - FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge.
We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types.
We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination.
Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z) - Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation [0.0]
This work explores the potential of Large Language Models for dialoguing with biomedical background knowledge.
The framework involves of three evaluation steps, each assessing different aspects sequentially: fluency, prompt alignment, semantic coherence, factual knowledge, and specificity of the generated responses.
The work provides a systematic assessment on the ability of eleven state-of-the-art models LLMs, including ChatGPT, GPT-4 and Llama 2, in two prompting-based tasks.
arXiv Detail & Related papers (2023-05-28T22:46:21Z) - Injecting Knowledge into Biomedical Pre-trained Models via Polymorphism
and Synonymous Substitution [22.471123408160658]
Pre-trained language models (PLMs) were considered to be able to store relational knowledge present in the training data.
Low-frequency relational knowledge might be underexpressed compared to high-frequency one in PLMs.
We propose a simple-yet-effective approach to inject relational knowledge into PLMs.
arXiv Detail & Related papers (2023-05-24T10:48:53Z) - Knowledge Rumination for Pre-trained Language Models [77.55888291165462]
We propose a new paradigm dubbed Knowledge Rumination to help the pre-trained language model utilize related latent knowledge without retrieving it from the external corpus.
We apply the proposed knowledge rumination to various language models, including RoBERTa, DeBERTa, and GPT-3.
arXiv Detail & Related papers (2023-05-15T15:47:09Z) - The KITMUS Test: Evaluating Knowledge Integration from Multiple Sources
in Natural Language Understanding Systems [87.3207729953778]
We evaluate state-of-the-art coreference resolution models on our dataset.
Several models struggle to reason on-the-fly over knowledge observed both at pretrain time and at inference time.
Still, even the best performing models seem to have difficulties with reliably integrating knowledge presented only at inference time.
arXiv Detail & Related papers (2022-12-15T23:26:54Z) - Rewire-then-Probe: A Contrastive Recipe for Probing Biomedical Knowledge
of Pre-trained Language Models [16.535312449449165]
We release a well-curated biomedical knowledge probing benchmark, MedLAMA, based on the Unified Medical Language System (UMLS) Metathesaurus.
We test a wide spectrum of state-of-the-art PLMs and probing approaches on our benchmark, reaching at most 3% of acc@10.
We propose Contrastive-Probe, a novel self-supervised contrastive probing approach, that adjusts the underlying PLMs without using any probing data.
arXiv Detail & Related papers (2021-10-15T16:00:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.