GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning
- URL: http://arxiv.org/abs/2407.01892v1
- Date: Tue, 2 Jul 2024 02:27:46 GMT
- Title: GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning
- Authors: Zhisheng Tang, Mayank Kejriwal,
- Abstract summary: Spatial reasoning is one of the core commonsense skills that is not purely language-based and requires some minimum degree of planning.
Existing benchmarks of Commonsense Spatial Reasoning (CSR) tend to evaluate how Large Language Models (LLMs) interpret text-based spatial descriptions.
We construct a large-scale benchmark called $textbfGRASP$, which consists of 16,000 grid-based environments where the agent is tasked with an energy collection problem.
- Score: 2.9312156642007294
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spatial reasoning, an important faculty of human cognition with many practical applications, is one of the core commonsense skills that is not purely language-based and, for satisfying (as opposed to optimal) solutions, requires some minimum degree of planning. Existing benchmarks of Commonsense Spatial Reasoning (CSR) tend to evaluate how Large Language Models (LLMs) interpret text-based spatial descriptions rather than directly evaluate a plan produced by the LLM in response to a spatial reasoning scenario. In this paper, we construct a large-scale benchmark called $\textbf{GRASP}$, which consists of 16,000 grid-based environments where the agent is tasked with an energy collection problem. These environments include 100 grid instances instantiated using each of the 160 different grid settings, involving five different energy distributions, two modes of agent starting position, and two distinct obstacle configurations, as well as three kinds of agent constraints. Using GRASP, we compare classic baseline approaches, such as random walk and greedy search methods, with advanced LLMs like GPT-3.5-Turbo and GPT-4o. The experimental results indicate that even these advanced LLMs struggle to consistently achieve satisfactory solutions.
Related papers
- SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization [70.11167263638562]
Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images.
We first present a simple yet well-crafted framework named name, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework.
arXiv Detail & Related papers (2024-10-28T18:10:26Z) - Optimizing Large Language Models for Dynamic Constraints through Human-in-the-Loop Discriminators [0.0]
Large Language Models (LLMs) have recently demonstrated impressive capabilities across various real-world applications.
We propose a flexible framework that enables LLMs to interact with system interfaces, summarize constraint concepts, and continually optimize performance metrics.
Our framework achieved a $7.78%$ pass rate with the human discriminator and a $6.11%$ pass rate with the LLM-based discriminator.
arXiv Detail & Related papers (2024-10-19T17:27:38Z) - Words as Beacons: Guiding RL Agents with High-Level Language Prompts [6.7236795813629]
Large Language Models (LLMs) as "teachers" guide the agent's learning process by decomposing complex tasks into subgoals.
LLMs can provide subgoals to accomplish the task defined for the environment in a similar fashion to how a human would do.
It is possible to query the LLM only during the training phase, enabling agents to operate within the environment without any LLM intervention.
arXiv Detail & Related papers (2024-10-11T08:54:45Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z) - StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving [76.5322280307861]
StrategyLLM allows LLMs to perform inductive reasoning, deriving general strategies from specific task instances, and deductive reasoning, applying these general strategies to particular task examples, for constructing generalizable and consistent few-shot prompts.
Experimental results demonstrate that StrategyLLM outperforms the competitive baseline CoT-SC that requires human-annotated solutions on 13 datasets across 4 challenging tasks without human involvement, including math reasoning (34.2% $rightarrow$ 38.8%), commonsense reasoning (70.3% $rightarrow$ 72.5%), algorithmic reasoning (73.7% $rightarrow$ 85.0
arXiv Detail & Related papers (2023-11-15T09:18:09Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Integrating LLMs and Decision Transformers for Language Grounded
Generative Quality-Diversity [0.0]
Quality-Diversity is a branch of optimization that is often applied to problems from the Reinforcement Learning and control domains.
We propose a Large Language Model to augment the repertoire with natural language descriptions of trajectories.
We also propose an LLM-based approach to evaluating the performance of such generative agents.
arXiv Detail & Related papers (2023-08-25T10:00:06Z) - Revisiting Some Common Practices in Cooperative Multi-Agent
Reinforcement Learning [11.91425153754564]
We show that in environments with a highly multi-modal reward landscape, value decomposition, and parameter sharing can be problematic and lead to undesired outcomes.
In contrast, policy gradient (PG) methods with individual policies provably converge to an optimal solution in these cases.
We present practical suggestions on implementing multi-agent PG algorithms for either high rewards or diverse emergent behaviors.
arXiv Detail & Related papers (2022-06-15T13:03:05Z) - Selection-Inference: Exploiting Large Language Models for Interpretable
Logical Reasoning [14.663216851932646]
We show that language models tend to perform fairly well at single step inference tasks, but struggle to chain together multiple reasoning steps to solve more complex problems.
We propose a Selection-Inference (SI) framework that exploits pre-trained LLMs as general processing modules.
We show that a 7B parameter LLM used within the SI framework in a 5-shot generalisation setting, with no fine-tuning, yields a performance improvement of over 100%.
arXiv Detail & Related papers (2022-05-19T17:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.