How Much Coffee Was Consumed During EMNLP 2019? Fermi Problems: A New
Reasoning Challenge for AI
- URL: http://arxiv.org/abs/2110.14207v1
- Date: Wed, 27 Oct 2021 06:39:33 GMT
- Title: How Much Coffee Was Consumed During EMNLP 2019? Fermi Problems: A New
Reasoning Challenge for AI
- Authors: Ashwin Kalyan, Abhinav Kumar, Arjun Chandrasekaran, Ashish Sabharwal,
Peter Clark
- Abstract summary: We propose a new reasoning challenge, namely Fermi Problems (FPs)
FPs are questions whose answers can only be approximately estimated because their precise computation is either impractical or impossible.
We present two datasets: 1) A collection of 1k real-world FPs sourced from quizzes and olympiads; and 2) a bank of 10k synthetic FPs of intermediate complexity to serve as a sandbox for the harder real-world challenge.
- Score: 32.54495599722743
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many real-world problems require the combined application of multiple
reasoning abilities employing suitable abstractions, commonsense knowledge, and
creative synthesis of problem-solving strategies. To help advance AI systems
towards such capabilities, we propose a new reasoning challenge, namely Fermi
Problems (FPs), which are questions whose answers can only be approximately
estimated because their precise computation is either impractical or
impossible. For example, "How much would the sea level rise if all ice in the
world melted?" FPs are commonly used in quizzes and interviews to bring out and
evaluate the creative reasoning abilities of humans. To do the same for AI
systems, we present two datasets: 1) A collection of 1k real-world FPs sourced
from quizzes and olympiads; and 2) a bank of 10k synthetic FPs of intermediate
complexity to serve as a sandbox for the harder real-world challenge. In
addition to question answer pairs, the datasets contain detailed solutions in
the form of an executable program and supporting facts, helping in supervision
and evaluation of intermediate steps. We demonstrate that even extensively
fine-tuned large scale language models perform poorly on these datasets, on
average making estimates that are off by two orders of magnitude. Our
contribution is thus the crystallization of several unsolved AI problems into a
single, new challenge that we hope will spur further advances in building
systems that can reason.
Related papers
- FEABench: Evaluating Language Models on Multiphysics Reasoning Ability [8.441945838936444]
We present FEABench, a benchmark to evaluate the ability of large language models (LLMs) and LLM agents to simulate and solve physics, mathematics and engineering problems using finite element analysis (FEA)
We introduce a comprehensive evaluation scheme to investigate the ability of LLMs to solve these problems end-to-end by reasoning over natural language problem descriptions and operating COMSOL Multiphysics$circledR$, an FEA software, to compute the answers.
arXiv Detail & Related papers (2025-04-08T17:59:39Z) - Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics [13.530403536762064]
We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology.
The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level.
We evaluate our data set on various open and closed language models, including o3-mini, o1, DeepSeek-R1, GPT-4o and versions of Llama and Qwen.
arXiv Detail & Related papers (2025-02-19T19:00:00Z) - Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA [43.116608441891096]
Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning.
State-of-the-art LLMs like GPT-4 and LLaMA show superior performance on targeted information retrieval.
arXiv Detail & Related papers (2024-10-09T03:53:26Z) - OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI [73.75520820608232]
We introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities.
These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage.
Our evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration.
arXiv Detail & Related papers (2024-06-18T16:20:53Z) - Cognition is All You Need -- The Next Layer of AI Above Large Language
Models [0.0]
We present Cognitive AI, a framework for neurosymbolic cognition outside of large language models.
We propose that Cognitive AI is a necessary precursor for the evolution of the forms of AI, such as AGI, and specifically claim that AGI cannot be achieved by probabilistic approaches on their own.
We conclude with a discussion of the implications for large language models, adoption cycles in AI, and commercial Cognitive AI development.
arXiv Detail & Related papers (2024-03-04T16:11:57Z) - ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life
Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints.
Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal.
We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z) - Successive Prompting for Decomposing Complex Questions [50.00659445976735]
Recent works leverage the capabilities of large language models (LMs) to perform complex question answering in a few-shot setting.
We introduce Successive Prompting'', where we iteratively break down a complex task into a simple task, solve it, and then repeat the process until we get the final solution.
Our best model (with successive prompting) achieves an improvement of 5% absolute F1 on a few-shot version of the DROP dataset.
arXiv Detail & Related papers (2022-12-08T06:03:38Z) - ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational
Finance Question Answering [70.6359636116848]
We propose a new large-scale dataset, ConvFinQA, to study the chain of numerical reasoning in conversational question answering.
Our dataset poses great challenge in modeling long-range, complex numerical reasoning paths in real-world conversations.
arXiv Detail & Related papers (2022-10-07T23:48:50Z) - Doubly-stochastic mining for heterogeneous retrieval [74.43785301907276]
Modern retrieval problems are characterised by training sets with potentially billions of labels.
With a large number of labels, standard losses are difficult to optimise even on a single example.
We propose doubly-stochastic mining (S2M) to address both challenges.
arXiv Detail & Related papers (2020-04-23T00:43:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.