Related papers: STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

URL: http://arxiv.org/abs/2503.23765v3
Date: Mon, 21 Apr 2025 13:43:53 GMT
Title: STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
Authors: Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, Bo Zhao,
Abstract summary: Multimodal Large Language Models (MLLMs) are an end-to-end solution for Embodied AI and Autonomous Driving.<n>We introduce STI-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal understanding.<n>Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios.
Score: 15.419733591210514
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their ability to perform precise and quantitative spatial-temporal understanding in real-world applications remains largely unexamined, leading to uncertain prospects. To evaluate models' Spatial-Temporal Intelligence, we introduce STI-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal understanding through challenging tasks such as estimating and predicting the appearance, pose, displacement, and motion of objects. Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios. The extensive experiments reveals that the state-of-the-art MLLMs still struggle in real-world spatial-temporal understanding, especially in tasks requiring precise distance estimation and motion analysis.

Related papers

REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories [19.741468026765062]
We introduce REM (Reasoning over Embodied Multi-Frame Trajectories), a benchmark using controllable 3D environments for embodied spatial reasoning.<n> REM systematically evaluates key aspects like object permanence/distinction, spatial relationships, and numerical tracking across dynamic embodied viewpoints.<n>Our evaluation shows that the best-performing current models exhibit promising overall performance, but become increasingly unreliable at even moderate complexity levels easily handled by humans.
arXiv Detail & Related papers (2025-11-30T05:20:22Z)
Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture [16.15618237704827]
We present a systematic analysis of spatial understanding from both data and architectural perspectives.<n>From the data perspective, the performance of spatial understanding converges quickly as the training data increases.<n>From the architectural perspective, we find that spatial understanding relies more heavily on the positional encoding within the visual encoder than within the language model.
arXiv Detail & Related papers (2025-09-02T14:22:43Z)
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models [66.833085504228]
We introduce V4DLM, the first benchmark specifically designed to evaluate visual language models (VLMs)<n>Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs.<n>We identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models.
arXiv Detail & Related papers (2025-08-04T06:06:06Z)
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z)
A Call for New Recipes to Enhance Spatial Reasoning in MLLMs [85.67171333213301]
Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks. Recent studies have exposed critical limitations in their spatial reasoning capabilities. This deficiency in spatial reasoning significantly constrains MLLMs' ability to interact effectively with the physical world.
arXiv Detail & Related papers (2025-04-21T11:48:39Z)
Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization [7.0683335354070085]
We analyze recent MLLMs that have been explicitly trained to include fine-grained spatial reasoning capabilities. We demonstrate that these models are performant in certain settings, making them well suited for zero-shot scenarios.
arXiv Detail & Related papers (2025-04-14T21:34:06Z)
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents [57.4686961979566]
EmbodiedEval is a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks.<n>It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity.<n>We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks.
arXiv Detail & Related papers (2025-01-21T03:22:10Z)
Benchmarking Large and Small MLLMs [71.78055760441256]
Large multimodal language models (MLLMs) have achieved remarkable advancements in understanding and generating multimodal content. However, their deployment faces significant challenges, including slow inference, high computational cost, and impracticality for on-device applications. Small MLLMs, exemplified by the LLava-series models and Phi-3-Vision, offer promising alternatives with faster inference, reduced deployment costs, and the ability to handle domain-specific scenarios.
arXiv Detail & Related papers (2025-01-04T07:44:49Z)
MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image [16.040813949620958]
We introduce MOSABench, a novel evaluation dataset designed specifically for multi-object sentiment analysis.<n>Key innovations in MOSABench include distance-based target annotation, post-processing for evaluation to standardize outputs, and an improved scoring mechanism.<n>This research underscores the need for MLLMs to enhance accuracy in complex, multi-object sentiment analysis tasks.
arXiv Detail & Related papers (2024-11-25T09:00:36Z)
DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving [13.115027801151484]
We introduce DriveMLLM, a benchmark designed to evaluate the spatial understanding capabilities of multimodal large language models (MLLMs) in autonomous driving. DriveMLLM includes 880 front-facing camera images and introduces both absolute and relative spatial reasoning tasks, accompanied by linguistically diverse natural language questions. We evaluate several state-of-the-art MLLMs on DriveMLLM, and our results reveal the limitations of current models in understanding complex spatial relationships in driving contexts.
arXiv Detail & Related papers (2024-11-20T08:14:01Z)
A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems. This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z)
Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
arXiv Detail & Related papers (2024-05-09T17:52:42Z)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.<n>Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.<n>We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z)
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios. We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z)
Enhancing the Spatial Awareness Capability of Multi-Modal Large Language Model [25.86351431223383]
The Multi-Modal Large Language Model (MLLM) is an extension of the Large Language Model (LLM) equipped with the capability to receive and infer multi-modal data. This paper proposes using more precise spatial position information between objects to guide MLLM in providing more accurate responses to user-related inquiries.
arXiv Detail & Related papers (2023-10-31T10:57:35Z)
Transformer Inertial Poser: Attention-based Real-time Human Motion Reconstruction from Sparse IMUs [79.72586714047199]
We propose an attention-based deep learning method to reconstruct full-body motion from six IMU sensors in real-time. Our method achieves new state-of-the-art results both quantitatively and qualitatively, while being simple to implement and smaller in size.
arXiv Detail & Related papers (2022-03-29T16:24:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.