Context Variance Evaluation of Pretrained Language Models for
Prompt-based Biomedical Knowledge Probing
- URL: http://arxiv.org/abs/2211.10265v1
- Date: Fri, 18 Nov 2022 14:44:09 GMT
- Title: Context Variance Evaluation of Pretrained Language Models for
Prompt-based Biomedical Knowledge Probing
- Authors: Zonghai Yao, Yi Cao, Zhichao Yang, Hong Yu
- Abstract summary: We show that prompt-based probing methods can only probe a lower bound of knowledge.
We introduce context variance into the prompt generation and propose a new rank-change-based evaluation metric.
- Score: 9.138354194112395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained language models (PLMs) have motivated research on what kinds of
knowledge these models learn. Fill-in-the-blanks problem (e.g., cloze tests) is
a natural approach for gauging such knowledge. BioLAMA generates prompts for
biomedical factual knowledge triples and uses the Top-k accuracy metric to
evaluate different PLMs' knowledge. However, existing research has shown that
such prompt-based knowledge probing methods can only probe a lower bound of
knowledge. Many factors like prompt-based probing biases make the LAMA
benchmark unreliable and unstable. This problem is more prominent in BioLAMA.
The severe long-tailed distribution in vocabulary and large-N-M relation make
the performance gap between LAMA and BioLAMA remain notable. To address these,
we introduce context variance into the prompt generation and propose a new
rank-change-based evaluation metric. Different from the previous known-unknown
evaluation criteria, we propose the concept of "Misunderstand" in LAMA for the
first time. Through experiments on 12 PLMs, our context variance prompts and
Understand-Confuse-Misunderstand (UCM) metric makes BioLAMA more friendly to
large-N-M relations and rare relations. We also conducted a set of control
experiments to disentangle "understand" from just "read and copy".
Related papers
- Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models [55.332004960574004]
Large language models (LLMs) are widely used in decision-making, but their reliability, especially in critical tasks like healthcare, is not well-established.
This paper investigates how the uncertainty of responses generated by LLMs relates to the information provided in the input prompt.
We propose a prompt-response concept model that explains how LLMs generate responses and helps understand the relationship between prompts and response uncertainty.
arXiv Detail & Related papers (2024-07-20T11:19:58Z) - Uncertainty Estimation of Large Language Models in Medical Question Answering [60.72223137560633]
Large Language Models (LLMs) show promise for natural language generation in healthcare, but risk hallucinating factually incorrect information.
We benchmark popular uncertainty estimation (UE) methods with different model sizes on medical question-answering datasets.
Our results show that current approaches generally perform poorly in this domain, highlighting the challenge of UE for medical applications.
arXiv Detail & Related papers (2024-07-11T16:51:33Z) - What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models [15.057992220389604]
Language models often struggle with handling factual knowledge, exhibiting factual hallucination issue.
We introduce a knowledge probing benchmark, BELIEF(ICL), to evaluate the knowledge recall ability of both encoder- and decoder-based pre-trained language models.
We semi-automatically create MyriadLAMA, which has massively diverse prompts.
arXiv Detail & Related papers (2024-06-18T05:11:35Z) - Towards Reliable Latent Knowledge Estimation in LLMs: In-Context Learning vs. Prompting Based Factual Knowledge Extraction [15.534647327246239]
We propose an approach for estimating the latent knowledge embedded inside large language models (LLMs)
We leverage the in-context learning abilities of LLMs to estimate the extent to which an LLM knows the facts stored in a knowledge base.
arXiv Detail & Related papers (2024-04-19T15:40:39Z) - FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge.
We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types.
We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination.
Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z) - Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation [0.0]
This work explores the potential of Large Language Models for dialoguing with biomedical background knowledge.
The framework involves of three evaluation steps, each assessing different aspects sequentially: fluency, prompt alignment, semantic coherence, factual knowledge, and specificity of the generated responses.
The work provides a systematic assessment on the ability of eleven state-of-the-art models LLMs, including ChatGPT, GPT-4 and Llama 2, in two prompting-based tasks.
arXiv Detail & Related papers (2023-05-28T22:46:21Z) - Injecting Knowledge into Biomedical Pre-trained Models via Polymorphism
and Synonymous Substitution [22.471123408160658]
Pre-trained language models (PLMs) were considered to be able to store relational knowledge present in the training data.
Low-frequency relational knowledge might be underexpressed compared to high-frequency one in PLMs.
We propose a simple-yet-effective approach to inject relational knowledge into PLMs.
arXiv Detail & Related papers (2023-05-24T10:48:53Z) - Knowledge Rumination for Pre-trained Language Models [77.55888291165462]
We propose a new paradigm dubbed Knowledge Rumination to help the pre-trained language model utilize related latent knowledge without retrieving it from the external corpus.
We apply the proposed knowledge rumination to various language models, including RoBERTa, DeBERTa, and GPT-3.
arXiv Detail & Related papers (2023-05-15T15:47:09Z) - The KITMUS Test: Evaluating Knowledge Integration from Multiple Sources
in Natural Language Understanding Systems [87.3207729953778]
We evaluate state-of-the-art coreference resolution models on our dataset.
Several models struggle to reason on-the-fly over knowledge observed both at pretrain time and at inference time.
Still, even the best performing models seem to have difficulties with reliably integrating knowledge presented only at inference time.
arXiv Detail & Related papers (2022-12-15T23:26:54Z) - Rewire-then-Probe: A Contrastive Recipe for Probing Biomedical Knowledge
of Pre-trained Language Models [16.535312449449165]
We release a well-curated biomedical knowledge probing benchmark, MedLAMA, based on the Unified Medical Language System (UMLS) Metathesaurus.
We test a wide spectrum of state-of-the-art PLMs and probing approaches on our benchmark, reaching at most 3% of acc@10.
We propose Contrastive-Probe, a novel self-supervised contrastive probing approach, that adjusts the underlying PLMs without using any probing data.
arXiv Detail & Related papers (2021-10-15T16:00:11Z) - Exploring Bayesian Deep Learning for Urgent Instructor Intervention Need
in MOOC Forums [58.221459787471254]
Massive Open Online Courses (MOOCs) have become a popular choice for e-learning thanks to their great flexibility.
Due to large numbers of learners and their diverse backgrounds, it is taxing to offer real-time support.
With the large volume of posts and high workloads for MOOC instructors, it is unlikely that the instructors can identify all learners requiring intervention.
This paper explores for the first time Bayesian deep learning on learner-based text posts with two methods: Monte Carlo Dropout and Variational Inference.
arXiv Detail & Related papers (2021-04-26T15:12:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.