Related papers: Exploring Failure Cases in Multimodal Reasoning About Physical Dynamics

Exploring Failure Cases in Multimodal Reasoning About Physical Dynamics

URL: http://arxiv.org/abs/2402.15654v1
Date: Sat, 24 Feb 2024 00:01:01 GMT
Title: Exploring Failure Cases in Multimodal Reasoning About Physical Dynamics
Authors: Sadaf Ghaffari, Nikhil Krishnaswamy
Abstract summary: We construct a simple simulated environment and demonstrate examples of where, in a zero-shot setting, both text and multimodal LLMs display atomic world knowledge about various objects but fail to compose this knowledge in correct solutions for an object manipulation and placement task. We also use BLIP, a vision-language model trained with more sophisticated cross-modal attention, to identify cases relevant to object physical properties that that model fails to ground.
Score: 5.497036643694402
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we present an exploration of LLMs' abilities to problem solve with physical reasoning in situated environments. We construct a simple simulated environment and demonstrate examples of where, in a zero-shot setting, both text and multimodal LLMs display atomic world knowledge about various objects but fail to compose this knowledge in correct solutions for an object manipulation and placement task. We also use BLIP, a vision-language model trained with more sophisticated cross-modal attention, to identify cases relevant to object physical properties that that model fails to ground. Finally, we present a procedure for discovering the relevant properties of objects in the environment and propose a method to distill this knowledge back into the LLM.

Related papers

A Call for New Recipes to Enhance Spatial Reasoning in MLLMs [85.67171333213301]
Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks. Recent studies have exposed critical limitations in their spatial reasoning capabilities. This deficiency in spatial reasoning significantly constrains MLLMs' ability to interact effectively with the physical world.
arXiv Detail & Related papers (2025-04-21T11:48:39Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models [35.01842161084472]
We propose a new physical reasoning task and a dataset, dubbed TraySim. Our task involves predicting the dynamics of several objects on a tray that is given an external impact. We present LLMPhy, a zero-shot black-box optimization framework that leverages the physics knowledge and program synthesis abilities of LLMs. Our results show that the combination of the LLM and the physics engine leads to state-of-the-art zero-shot physical reasoning performance.
arXiv Detail & Related papers (2024-11-12T18:56:58Z)
Language Agents Meet Causality -- Bridging LLMs and Causal World Models [50.79984529172807]
We propose a framework that integrates causal representation learning with large language models. This framework learns a causal world model, with causal variables linked to natural language expressions. We evaluate the framework on causal inference and planning tasks across temporal scales and environmental complexities.
arXiv Detail & Related papers (2024-10-25T18:36:37Z)
Open-World Object Detection with Instance Representation Learning [1.8749305679160366]
We propose a method to train an object detector that can both detect novel objects and extract semantically rich features in open-world conditions. Our method learns a robust and generalizable feature space, outperforming other OWOD-based feature extraction methods.
arXiv Detail & Related papers (2024-09-24T13:13:34Z)
Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange [50.45953583802282]
We introduce a novel self-supervised learning (SSL) strategy for point cloud scene understanding. Our approach leverages both object patterns and contextual cues to produce robust features. Our experiments demonstrate the superiority of our method over existing SSL techniques.
arXiv Detail & Related papers (2024-04-11T06:39:53Z)
Cognitive Planning for Object Goal Navigation using Generative AI Models [0.979851640406258]
We present a novel framework for solving the object goal navigation problem that generates efficient exploration strategies. Our approach enables a robot to navigate unfamiliar environments by leveraging Large Language Models (LLMs) and Large Vision-Language Models (LVLMs)
arXiv Detail & Related papers (2024-03-30T10:54:59Z)
Object Detectors in the Open Environment: Challenges, Solutions, and Outlook [95.3317059617271]
The dynamic and intricate nature of the open environment poses novel and formidable challenges to object detectors. This paper aims to conduct a comprehensive review and analysis of object detectors in open environments. We propose a framework that includes four quadrants (i.e., out-of-domain, out-of-category, robust learning, and incremental learning) based on the dimensions of the data / target changes.
arXiv Detail & Related papers (2024-03-24T19:32:39Z)
Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs) We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z)
LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges. Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model. This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z)
From Understanding to Utilization: A Survey on Explainability for Large Language Models [27.295767173801426]
This survey underscores the imperative for increased explainability in Large Language Models (LLMs) Our focus is primarily on pre-trained Transformer-based LLMs, which pose distinctive interpretability challenges due to their scale and complexity. When considering the utilization of explainability, we explore several compelling methods that concentrate on model editing, control generation, and model enhancement.
arXiv Detail & Related papers (2024-01-23T16:09:53Z)
LLMs for Robotic Object Disambiguation [21.101902684740796]
Our study reveals the LLM's aptitude for solving complex decision making challenges. A pivotal focus of our research is the object disambiguation capability of LLMs. We have developed a few-shot prompt engineering system to improve the LLM's ability to pose disambiguating queries.
arXiv Detail & Related papers (2024-01-07T04:46:23Z)
Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study [10.95835611110119]
We introduce a novel task -- Minesweeper -- designed in a format unfamiliar to Large Language Models (LLMs) This task challenges LLMs to identify the locations of mines based on numerical clues provided by adjacent opened cells. Our experiments, including trials with the advanced GPT-4 model, indicate that while LLMs possess the foundational abilities required for this task, they struggle to integrate these into a coherent, multi-step logical reasoning process needed to solve Minesweeper.
arXiv Detail & Related papers (2023-11-13T15:11:26Z)
Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels. Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions. We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.