Related papers: Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language Models

Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language Models

URL: http://arxiv.org/abs/2405.09605v2
Date: Thu, 03 Jul 2025 20:10:24 GMT
Title: Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language Models
Authors: Anna A. Ivanova, Aalok Sathe, Benjamin Lipkin, Unnathi Kumar, Setayesh Radkani, Thomas H. Clark, Carina Kauf, Jennifer Hu, R. T. Pramod, Gabriel Grand, Vivian Paulun, Maria Ryskina, Ekin Akyürek, Ethan Wilcox, Nafisa Rashid, Leshem Choshen, Roger Levy, Evelina Fedorenko, Joshua Tenenbaum, Jacob Andreas,
Abstract summary: Elements of World Knowledge (EWoK) is a framework for evaluating language models' understanding of conceptual knowledge underlying world modeling.<n>EWoK-core-1.0 is a dataset of 4,374 items covering 11 world knowledge domains.<n>All tested models perform worse than humans, with results varying drastically across domains.
Score: 51.891804790725686
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ability to build and reason about models of the world is essential for situated language understanding. But evaluating world modeling capabilities in modern AI systems -- especially those based on language models -- has proven challenging, in large part because of the difficulty of disentangling conceptual knowledge about the world from knowledge of surface co-occurrence statistics. This paper presents Elements of World Knowledge (EWoK), a framework for evaluating language models' understanding of the conceptual knowledge underlying world modeling. EWoK targets specific concepts from multiple knowledge domains known to be important for world modeling in humans, from social interactions (help, deceive) to spatial relations (left, right). Objects, agents, and locations in the items can be flexibly filled in, enabling easy generation of multiple controlled datasets. We then introduce EWoK-core-1.0, a dataset of 4,374 items covering 11 world knowledge domains. We evaluate 20 open-weights large language models (1.3B--70B parameters) and compare them with human performance. All tested models perform worse than humans, with results varying drastically across domains. Performance on social interactions and social properties was highest and performance on physical relations and spatial relations was lowest. Overall, this dataset highlights simple cases where even large models struggle and presents rich avenues for targeted research on LLM world modeling capabilities.

Related papers

Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models [93.1043186636177]
We explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations.<n>We propose a computational implementation of this idea -- a Model Synthesis Architecture''<n>We evaluate our MSA as a model of human judgments on a novel reasoning dataset.
arXiv Detail & Related papers (2025-07-16T18:01:03Z)
PoE-World: Compositional World Modeling with Products of Programmatic Experts [41.07916209987106]
Learning how the world works is central to building AI agents that can adapt to complex environments.<n>Recent advances in program synthesis using Large Language Models (LLMs) give an alternate approach which learns world models represented as source code.<n>We show that this approach can learn complex world models from just a few observations. We evaluate the learned world models by embedding them in a model-based planning agent, demonstrating efficient performance and generalization to unseen levels on Atari's Pong and Montezuma's Revenge.
arXiv Detail & Related papers (2025-05-16T03:28:42Z)
Learning Local Causal World Models with State Space Models and Attention [1.5498250598583487]
We show that a SSM can model the dynamics of a simple environment and learn a causal model at the same time.<n>We pave the way for further experiments that lean into the strength of SSMs and further enhance them with causal awareness.
arXiv Detail & Related papers (2025-05-04T11:57:02Z)
AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability [84.52205243353761]
Recent work proposes using world models to generate controlled virtual environments in which AI agents can be tested before deployment. We investigate ways of simplifying world models that remain agnostic to the AI agent under evaluation.
arXiv Detail & Related papers (2025-04-06T20:35:44Z)
FRIDA to the Rescue! Analyzing Synthetic Data Effectiveness in Object-Based Common Sense Reasoning for Disaster Response [19.744969357182665]
We introduce a pipeline to create Field Ready Instruction Decoding Agent (FRIDA) models. We fine-tune several LLaMa and Mistral instruction-tuned models and find that FRIDA models outperform their base models at a variety of sizes. We conclude that the FRIDA pipeline is capable of instilling general common sense, but needs to be augmented with information retrieval for specific domain knowledge.
arXiv Detail & Related papers (2025-02-25T18:51:06Z)
Text2World: Benchmarking Large Language Models for Symbolic World Model Generation [45.03755994315517]
We introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL) We find that reasoning models trained with large-scale reinforcement learning outperform others. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs.
arXiv Detail & Related papers (2025-02-18T17:59:48Z)
Large language models for artificial general intelligence (AGI): A survey of foundational principles and approaches [0.0]
Multimodal large language models (MLLMs) learn from vast and diverse data sources.<n>Despite this impressive feat, the cognitive abilities of state-of-the-art LLMs trained on large-scale datasets are still superficial and brittle.<n>We discuss how the principles of embodiment, symbol grounding, causality and memory can be leveraged toward the attainment of artificial general intelligence (AGI) in an organic manner.
arXiv Detail & Related papers (2025-01-06T17:18:47Z)
Making Large Language Models into World Models with Precondition and Effect Knowledge [1.8561812622368763]
We show that Large Language Models (LLMs) can be induced to perform two critical world model functions. We validate that the precondition and effect knowledge generated by our models aligns with human understanding of world dynamics.
arXiv Detail & Related papers (2024-09-18T19:28:04Z)
Evaluating the World Model Implicit in a Generative Model [7.317896355747284]
Recent work suggests that large language models may implicitly learn world models. This includes problems as diverse as simple logical reasoning, geographic navigation, game-playing, and chemistry. We propose new evaluation metrics for world model recovery inspired by the classic Myhill-Nerode theorem from language theory.
arXiv Detail & Related papers (2024-06-06T02:20:31Z)
WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning [49.72868038180909]
We present WorldQA, a video dataset designed to push the boundaries of multimodal world models. We identify five essential types of world knowledge for question formulation. We introduce WorldRetriever, an agent designed to synthesize expert knowledge into a coherent reasoning chain.
arXiv Detail & Related papers (2024-05-06T08:42:34Z)
Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection [9.788417605537965]
We introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement. Our proposed method achieves state-of-the-art results in open vocabulary HOI detection.
arXiv Detail & Related papers (2024-04-09T10:27:22Z)
Assessment of Multimodal Large Language Models in Alignment with Human Values [43.023052912326314]
We introduce Ch3Ef, a Compreh3ensive Evaluation dataset and strategy for assessing alignment with human expectations. Ch3Ef dataset contains 1002 human-annotated data samples, covering 12 domains and 46 tasks based on the hhh principle.
arXiv Detail & Related papers (2024-03-26T16:10:21Z)
Open World Object Detection in the Era of Foundation Models [53.683963161370585]
We introduce a new benchmark that includes five real-world application-driven datasets. We introduce a novel method, Foundation Object detection Model for the Open world, or FOMO, which identifies unknown objects based on their shared attributes with the base known objects.
arXiv Detail & Related papers (2023-12-10T03:56:06Z)
Carpe Diem: On the Evaluation of World Knowledge in Lifelong Language Models [74.81091933317882]
We introduce EvolvingQA, a temporally evolving question-answering benchmark designed for training and evaluating LMs on an evolving Wikipedia database. We uncover that existing continual learning baselines suffer from updating and removing outdated knowledge. Our work aims to model the dynamic nature of real-world information, suggesting faithful evaluations of the evolution-adaptability of language models.
arXiv Detail & Related papers (2023-11-14T12:12:02Z)
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World [71.52132776748628]
We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. We create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. We develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding.
arXiv Detail & Related papers (2023-08-03T17:59:47Z)
Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z)
Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models [83.63242931107638]
We propose four characteristics of generally intelligent agents. We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations. We conclude by outlining promising future research directions in the field of artificial general intelligence.
arXiv Detail & Related papers (2023-07-07T13:58:16Z)
Language Models Meet World Models: Embodied Experiences Enhance Language Models [48.70726641605047]
Large language models (LMs) often struggle with simple reasoning and planning in physical environments. We propose a new paradigm of enhancing LMs by finetuning them with world models.
arXiv Detail & Related papers (2023-05-18T00:35:38Z)
CAZSL: Zero-Shot Regression for Pushing Models by Generalizing Through Context [13.217582954907234]
We study the problem of designing deep learning agents which can generalize their models of the physical world by building context-aware models. We present context-aware zero shot learning (CAZSL, pronounced as casual) models, an approach utilizing a Siamese network, embedding space and regularization based on context variables. We test our proposed learning algorithm on the recently released Omnipush datatset that allows testing of meta-learning capabilities.
arXiv Detail & Related papers (2020-03-26T01:21:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.