Related papers: GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning

GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning

URL: http://arxiv.org/abs/2407.01892v1
Date: Tue, 2 Jul 2024 02:27:46 GMT
Title: GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning
Authors: Zhisheng Tang, Mayank Kejriwal,
Abstract summary: Spatial reasoning is one of the core commonsense skills that is not purely language-based and requires some minimum degree of planning. Existing benchmarks of Commonsense Spatial Reasoning (CSR) tend to evaluate how Large Language Models (LLMs) interpret text-based spatial descriptions. We construct a large-scale benchmark called $textbfGRASP$, which consists of 16,000 grid-based environments where the agent is tasked with an energy collection problem.
Score: 2.9312156642007294
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spatial reasoning, an important faculty of human cognition with many practical applications, is one of the core commonsense skills that is not purely language-based and, for satisfying (as opposed to optimal) solutions, requires some minimum degree of planning. Existing benchmarks of Commonsense Spatial Reasoning (CSR) tend to evaluate how Large Language Models (LLMs) interpret text-based spatial descriptions rather than directly evaluate a plan produced by the LLM in response to a spatial reasoning scenario. In this paper, we construct a large-scale benchmark called $\textbf{GRASP}$, which consists of 16,000 grid-based environments where the agent is tasked with an energy collection problem. These environments include 100 grid instances instantiated using each of the 160 different grid settings, involving five different energy distributions, two modes of agent starting position, and two distinct obstacle configurations, as well as three kinds of agent constraints. Using GRASP, we compare classic baseline approaches, such as random walk and greedy search methods, with advanced LLMs like GPT-3.5-Turbo and GPT-4o. The experimental results indicate that even these advanced LLMs struggle to consistently achieve satisfactory solutions.

Related papers

SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model [61.97017867656831]
We introduce a new task, ie, geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region. We construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs. SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods.
arXiv Detail & Related papers (2025-04-13T16:36:47Z)
GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks [0.0]
Sonnet 3.5 and GPT-4o achieve the best overall performance, with Claude models excelling on solvable tasks. Common errors include misunderstanding geometrical relationships, relying on outdated knowledge, and inefficient data manipulation.
arXiv Detail & Related papers (2025-03-23T16:20:14Z)
Aligning Multimodal LLM with Human Preference: A Survey [62.89722942008262]
Large language models (LLMs) can handle a wide variety of general tasks with simple prompts, without the need for task-specific training. Multimodal Large Language Models (MLLMs) have demonstrated impressive potential in tackling complex tasks involving visual, auditory, and textual data. However, critical issues related to truthfulness, safety, o1-like reasoning, and alignment with human preference remain insufficiently addressed.
arXiv Detail & Related papers (2025-03-18T17:59:56Z)
SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization [70.11167263638562]
Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images. We first present a simple yet well-crafted framework named name, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework.
arXiv Detail & Related papers (2024-10-28T18:10:26Z)
Optimizing Large Language Models for Dynamic Constraints through Human-in-the-Loop Discriminators [0.0]
Large Language Models (LLMs) have recently demonstrated impressive capabilities across various real-world applications. We propose a flexible framework that enables LLMs to interact with system interfaces, summarize constraint concepts, and continually optimize performance metrics. Our framework achieved a $7.78%$ pass rate with the human discriminator and a $6.11%$ pass rate with the LLM-based discriminator.
arXiv Detail & Related papers (2024-10-19T17:27:38Z)
Words as Beacons: Guiding RL Agents with High-Level Language Prompts [6.7236795813629]
Large Language Models (LLMs) as "teachers" guide the agent's learning process by decomposing complex tasks into subgoals. LLMs can provide subgoals to accomplish the task defined for the environment in a similar fashion to how a human would do. It is possible to query the LLM only during the training phase, enabling agents to operate within the environment without any LLM intervention.
arXiv Detail & Related papers (2024-10-11T08:54:45Z)
On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability [59.72892401927283]
We evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks. Our results reveal that o1-preview outperforms GPT-4 in adhering to task constraints.
arXiv Detail & Related papers (2024-09-30T03:58:43Z)
CHESS: Contextual Harnessing for Efficient SQL Synthesis [1.9506402593665235]
We introduce CHESS, a framework for efficient and scalable text-to- queries. It comprises four specialized agents, each targeting one of the aforementioned challenges. Our framework offers features that adapt to various deployment constraints.
arXiv Detail & Related papers (2024-05-27T01:54:16Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
GroundCocoa: A Benchmark for Evaluating Compositional & Conditional Reasoning in Language Models [14.108788704400643]
GroundCocoa is a lexically diverse benchmark connecting these reasoning skills to the real-world problem of flight booking. Our task involves aligning detailed user preferences with available flight options presented in a multiple-choice format. Results indicate a significant disparity in performance among current state-of-the-art LLMs with even the best performing model, GPT-4 Turbo, not exceeding 67% accuracy despite advanced prompting techniques.
arXiv Detail & Related papers (2024-04-05T17:36:26Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving [76.5322280307861]
StrategyLLM allows LLMs to perform inductive reasoning, deriving general strategies from specific task instances, and deductive reasoning, applying these general strategies to particular task examples, for constructing generalizable and consistent few-shot prompts. Experimental results demonstrate that StrategyLLM outperforms the competitive baseline CoT-SC that requires human-annotated solutions on 13 datasets across 4 challenging tasks without human involvement, including math reasoning (34.2% $rightarrow$ 38.8%), commonsense reasoning (70.3% $rightarrow$ 72.5%), algorithmic reasoning (73.7% $rightarrow$ 85.0
arXiv Detail & Related papers (2023-11-15T09:18:09Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)
Integrating LLMs and Decision Transformers for Language Grounded Generative Quality-Diversity [0.0]
Quality-Diversity is a branch of optimization that is often applied to problems from the Reinforcement Learning and control domains. We propose a Large Language Model to augment the repertoire with natural language descriptions of trajectories. We also propose an LLM-based approach to evaluating the performance of such generative agents.
arXiv Detail & Related papers (2023-08-25T10:00:06Z)
Revisiting Some Common Practices in Cooperative Multi-Agent Reinforcement Learning [11.91425153754564]
We show that in environments with a highly multi-modal reward landscape, value decomposition, and parameter sharing can be problematic and lead to undesired outcomes. In contrast, policy gradient (PG) methods with individual policies provably converge to an optimal solution in these cases. We present practical suggestions on implementing multi-agent PG algorithms for either high rewards or diverse emergent behaviors.
arXiv Detail & Related papers (2022-06-15T13:03:05Z)
Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning [14.663216851932646]
We show that language models tend to perform fairly well at single step inference tasks, but struggle to chain together multiple reasoning steps to solve more complex problems. We propose a Selection-Inference (SI) framework that exploits pre-trained LLMs as general processing modules. We show that a 7B parameter LLM used within the SI framework in a 5-shot generalisation setting, with no fine-tuning, yields a performance improvement of over 100%.
arXiv Detail & Related papers (2022-05-19T17:25:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.