Probing LLM World Models: Enhancing Guesstimation with Wisdom of Crowds Decoding
- URL: http://arxiv.org/abs/2501.17310v2
- Date: Thu, 30 Jan 2025 07:15:04 GMT
- Title: Probing LLM World Models: Enhancing Guesstimation with Wisdom of Crowds Decoding
- Authors: Yun-Shiuan Chuang, Nikunj Harlalka, Sameer Narendran, Alexander Cheung, Sizhe Gao, Siddharth Suresh, Junjie Hu, Timothy T. Rogers,
- Abstract summary: We introduce a novel guesstimation dataset, MARBLES.
This dataset requires one to estimate how many items can fit into containers.
We propose WOC decoding'' strategy for LLM guesstimation.
- Score: 42.35821271298182
- License:
- Abstract: Guesstimation, the task of making approximate quantity estimates, is a common real-world challenge. However, it has been largely overlooked in large language models (LLMs) and vision language models (VLMs) research. We introduce a novel guesstimation dataset, MARBLES. This dataset requires one to estimate how many items (e.g., marbles) can fit into containers (e.g., a one-cup measuring cup), both with and without accompanying images. Inspired by the social science concept of the ``Wisdom of Crowds'' (WOC) - taking the median from estimates from a crowd), which has proven effective in guesstimation, we propose ``WOC decoding'' strategy for LLM guesstimation. We show that LLMs/VLMs perform well on guesstimation, suggesting that they possess some level of a "world model" necessary for guesstimation. Moreover, similar to human performance, the WOC decoding method improves LLM/VLM guesstimation accuracy. Furthermore, the inclusion of images in the multimodal condition enhances model performance. These results highlight the value of WOC decoding strategy for LLMs/VLMs and position guesstimation as a probe for evaluating LLMs/VLMs' world model. As LLMs' world model is a fundamental prerequisite for many real-world tasks, e.g., human-AI teaming, our findings have broad implications for the AI community.
Related papers
- Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.
LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.
Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z) - Agentic Society: Merging skeleton from real world and texture from Large Language Model [4.740886789811429]
This paper explores a novel framework that leverages census data and large language models to generate virtual populations.
We show that our method produces personas with variability essential for simulating diverse human behaviors in social science experiments.
But the evaluation result shows that only weak sign of statistical truthfulness can be produced due to limited capability of current LLMs.
arXiv Detail & Related papers (2024-09-02T08:28:19Z) - Bayesian Statistical Modeling with Predictors from LLMs [5.5711773076846365]
State of the art large language models (LLMs) have shown impressive performance on a variety of benchmark tasks.
This raises questions about the human-likeness of LLM-derived information.
arXiv Detail & Related papers (2024-06-13T11:33:30Z) - Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs [30.179703001666173]
Factuality issue is a critical concern for Large Language Models (LLMs)
We propose GraphEval to evaluate an LLM's performance using a substantially large test dataset.
Test dataset is retrieved from a large knowledge graph with more than 10 million facts without expensive human efforts.
arXiv Detail & Related papers (2024-04-01T06:01:17Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Evaluating and Explaining Large Language Models for Code Using Syntactic
Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code.
At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes.
We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z) - LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
Vision-Language Models [55.304181390027274]
This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub)
Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform.
The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
arXiv Detail & Related papers (2023-06-15T16:39:24Z) - LLMMaps -- A Visual Metaphor for Stratified Evaluation of Large Language
Models [13.659853119356507]
Large Language Models (LLMs) have revolutionized natural language processing and demonstrated impressive capabilities in various tasks.
They are prone to hallucinations, where the model exposes incorrect or false information in its responses.
We propose LLMMaps as a novel visualization technique that enables users to evaluate LLMs' performance with respect to Q&A datasets.
arXiv Detail & Related papers (2023-04-02T05:47:09Z) - Attributed Question Answering: Evaluation and Modeling for Attributed
Large Language Models [68.37431984231338]
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision.
We believe the ability of an LLM to an attribute to the text that it generates is likely to be crucial for both system developers and users in this setting.
arXiv Detail & Related papers (2022-12-15T18:45:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.