The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding
- URL: http://arxiv.org/abs/2502.08946v1
- Date: Thu, 13 Feb 2025 04:00:03 GMT
- Title: The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding
- Authors: Mo Yu, Lemao Liu, Junjie Wu, Tsz Ting Chung, Shunchi Zhang, Jiangnan Li, Dit-Yan Yeung, Jie Zhou,
- Abstract summary: We propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo.
Our task alleviates the issue via the usage of grid-format inputs that abstractly describe physical phenomena.
A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GP-4o, lag behind humans by 40%; (2) the parrot, o1 phenomenon is present in LLMs as they fail on our grid task but can describe and recognize the same concepts well in natural language.
- Score: 65.28200190598082
- License:
- Abstract: In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.
Related papers
- Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? [48.41029452721923]
Vision Language Models (VLMs) are impressive in tasks such as visual question answering (VQA) and image captioning.
Their ability to apply multi-step reasoning to images has lagged, giving rise to perceptions of modality imbalance or brittleness.
arXiv Detail & Related papers (2025-01-05T21:36:38Z) - Representation in large language models [0.0]
I argue that Large Language Models behavior is partially driven by representation-based information processing.
I describe techniques for investigating these representations and developing explanations.
arXiv Detail & Related papers (2025-01-01T16:19:48Z) - Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation [81.18701211912779]
We introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework.
This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings.
Our method has achieved state-of-the-art performance on two common datasets.
arXiv Detail & Related papers (2024-12-24T16:38:04Z) - Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective [5.769786334333616]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) based applications including automated text generation, question answering, and others.
They face a significant challenge: hallucinations, where models produce plausible-sounding but factually incorrect responses.
This paper discusses these open challenges covering state-of-the-art datasets and benchmarks as well as methods for knowledge integration and evaluating hallucinations.
arXiv Detail & Related papers (2024-11-21T16:09:05Z) - Meaningful Learning: Enhancing Abstract Reasoning in Large Language Models via Generic Fact Guidance [38.49506722997423]
Large language models (LLMs) have developed impressive performance and strong explainability across various reasoning scenarios.
LLMs often struggle to abstract and apply the generic fact to provide consistent and precise answers.
This has sparked a vigorous debate about whether LLMs are genuinely reasoning or merely memorizing.
arXiv Detail & Related papers (2024-03-14T04:06:13Z) - From Understanding to Utilization: A Survey on Explainability for Large
Language Models [27.295767173801426]
This survey underscores the imperative for increased explainability in Large Language Models (LLMs)
Our focus is primarily on pre-trained Transformer-based LLMs, which pose distinctive interpretability challenges due to their scale and complexity.
When considering the utilization of explainability, we explore several compelling methods that concentrate on model editing, control generation, and model enhancement.
arXiv Detail & Related papers (2024-01-23T16:09:53Z) - Sparsity-Guided Holistic Explanation for LLMs with Interpretable
Inference-Time Intervention [53.896974148579346]
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains.
The enigmatic black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications.
We propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs.
arXiv Detail & Related papers (2023-12-22T19:55:58Z) - Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks.
The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation.
We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z) - POSQA: Probe the World Models of LLMs with Size Comparisons [38.30479784257936]
Embodied language comprehension emphasizes that language understanding is not solely a matter of mental processing in the brain.
With the explosive growth of Large Language Models (LLMs) and their already ubiquitous presence in our daily lives, it is becoming increasingly necessary to verify their real-world understanding.
arXiv Detail & Related papers (2023-10-20T10:05:01Z) - MAML and ANIL Provably Learn Representations [60.17417686153103]
We prove that two well-known meta-learning methods, MAML and ANIL, are capable of learning common representation among a set of given tasks.
Specifically, in the well-known multi-task linear representation learning setting, they are able to recover the ground-truth representation at an exponentially fast rate.
Our analysis illuminates that the driving force causing MAML and ANIL to recover the underlying representation is that they adapt the final layer of their model.
arXiv Detail & Related papers (2022-02-07T19:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.