What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on
Curiosity-Driven Questioning
- URL: http://arxiv.org/abs/2409.17172v1
- Date: Thu, 19 Sep 2024 22:12:16 GMT
- Title: What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on
Curiosity-Driven Questioning
- Authors: Shashidhar Reddy Javaji, Zining Zhu
- Abstract summary: Large language models (LLMs) can store a massive amount of knowledge, yet their potential to acquire new knowledge remains unknown.
We propose a novel evaluation framework that evaluates this capability.
We find that while large models like GPT-4 and Mistral 8x7b are adept at generating coherent and relevant questions, the smaller Phi-2 model is equally or more effective.
- Score: 4.3512163406552
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) can store a massive amount of knowledge, yet
their potential to acquire new knowledge remains unknown. We propose a novel
evaluation framework that evaluates this capability. This framework prompts
LLMs to generate questions about a statement introducing scientific knowledge,
simulating a curious person when facing the statement for the first time. We
score the qualities of the generated questions, thereby evaluating the
knowledge acquisition potential of the LLM. We apply controlled ablation
studies to validate our scoring procedures. Additionally, we created a
synthetic dataset consisting of 1101 statements in physics, chemistry, and
maths with distinct levels of difficulties, 300 general knowledge statements,
and 567 incorrect statements. Human evaluations were conducted to validate our
model assessments, achieving an approximate weighted Cohen's kappa of 0.7 on
all three metrics considered. We find that while large models like GPT-4 and
Mistral 8x7b are adept at generating coherent and relevant questions, the
smaller Phi-2 model is equally or more effective. This indicates that size does
not solely determine a model's knowledge acquisition potential. The proposed
framework quantifies a critical model capability that was commonly overlooked
and opens up research opportunities for developing more knowledgeable AI
systems
Related papers
- What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models [15.057992220389604]
Language models often struggle with handling factual knowledge, exhibiting factual hallucination issue.
We introduce a knowledge probing benchmark, BELIEF(ICL), to evaluate the knowledge recall ability of both encoder- and decoder-based pre-trained language models.
We semi-automatically create MyriadLAMA, which has massively diverse prompts.
arXiv Detail & Related papers (2024-06-18T05:11:35Z) - R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges.
Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not.
We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning)
Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z) - Beyond Factuality: A Comprehensive Evaluation of Large Language Models
as Knowledge Generators [78.63553017938911]
Large language models (LLMs) outperform information retrieval techniques for downstream knowledge-intensive tasks.
However, community concerns abound regarding the factuality and potential implications of using this uncensored knowledge.
We introduce CONNER, designed to evaluate generated knowledge from six important perspectives.
arXiv Detail & Related papers (2023-10-11T08:22:37Z) - Physics of Language Models: Part 3.2, Knowledge Manipulation [51.68385617116854]
This paper investigates four fundamental knowledge manipulation tasks.
We show that language models excel in knowledge retrieval but struggle even in the simplest classification or comparison tasks.
Our findings also apply to modern pretrained language models such as GPT-4.
arXiv Detail & Related papers (2023-09-25T17:50:41Z) - KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA)
We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks.
We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z) - Do Large Language Models Know What They Don't Know? [74.65014158544011]
Large language models (LLMs) have a wealth of knowledge that allows them to excel in various Natural Language Processing (NLP) tasks.
Despite their vast knowledge, LLMs are still limited by the amount of information they can accommodate and comprehend.
This study aims to evaluate LLMs' self-knowledge by assessing their ability to identify unanswerable or unknowable questions.
arXiv Detail & Related papers (2023-05-29T15:30:13Z) - DisentQA: Disentangling Parametric and Contextual Knowledge with
Counterfactual Question Answering [34.70206857546496]
Question answering models commonly have access to two sources of "knowledge" during inference time.
It is unclear whether the answer stems from the given non-parametric knowledge or not.
We propose a new paradigm in which QA models are trained to disentangle the two sources of knowledge.
arXiv Detail & Related papers (2022-11-10T15:34:44Z) - How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?"
We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated.
We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z) - What Does My QA Model Know? Devising Controlled Probes using Expert
Knowledge [36.13528043657398]
We investigate whether state-of-the-art QA models have general knowledge about word definitions and general taxonomic reasoning.
We use a methodology for automatically building datasets from various types of expert knowledge.
Our evaluation confirms that transformer-based QA models are already predisposed to recognize certain types of structural lexical knowledge.
arXiv Detail & Related papers (2019-12-31T15:05:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.