Related papers: Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning

Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning

URL: http://arxiv.org/abs/2510.03441v1
Date: Fri, 03 Oct 2025 19:04:15 GMT
Title: Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning
Authors: Chashi Mahiul Islam, Oteo Mamo, Samuel Jacob Chacko, Xiuwen Liu, Weikuan Yu,
Abstract summary: Vision-language models (VLMs) have advanced multimodal reasoning but still face challenges in spatial reasoning for 3D scenes and complex object configurations.<n>We introduce SpatialViLT, an enhanced VLM that integrates spatial features like depth maps, 3D coordinates, and edge maps through a multi-task learning framework.<n>We propose two variants: SpatialViLT and MaskedSpatialViLT, focusing on full and masked object regions, respectively.<n>Our models excel in spatial reasoning categories such as directional, topological, and proximity relations, as demonstrated on the challenging Visual Spatial Reasoning (VSR) dataset.
Score: 1.5604334108839177
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) have advanced multimodal reasoning but still face challenges in spatial reasoning for 3D scenes and complex object configurations. To address this, we introduce SpatialViLT, an enhanced VLM that integrates spatial features like depth maps, 3D coordinates, and edge maps through a multi-task learning framework. This approach enriches multimodal embeddings with spatial understanding. We propose two variants: SpatialViLT and MaskedSpatialViLT, focusing on full and masked object regions, respectively. Additionally, SpatialEnsemble combines both approaches, achieving state-of-the-art accuracy. Our models excel in spatial reasoning categories such as directional, topological, and proximity relations, as demonstrated on the challenging Visual Spatial Reasoning (VSR) dataset. This work represents a significant step in enhancing the spatial intelligence of AI systems, crucial for advanced multimodal understanding and real-world applications.

Related papers

MosaicThinker: On-Device Visual Spatial Reasoning for Embodied AI via Iterative Construction of Space Representation [11.01583588981339]
We present a new inference-time computing technique for on-device embodied AI, namely emphMosaicThinker.<n>Our basic idea is to integrate fragmented spatial information from multiple frames into a unified space representation of global semantic map, and further guide the VLM's spatial reasoning over the semantic map via a visual prompt.<n>Experiment results show that our technique can greatly enhance the accuracy of cross-frame spatial reasoning on resource-constrained embodied AI devices, over reasoning tasks with diverse types and complexities.
arXiv Detail & Related papers (2026-02-06T06:17:29Z)
Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z)
SpatialMosaic: A Multiview VLM Dataset for Partial Visibility [25.874299974251965]
We propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs.<n>We introduce SpatialMosaic-Bench, a benchmark for evaluating multi-view spatial reasoning under realistic and challenging scenarios.<n>We also present SpatialMosaicVLM, a hybrid framework that integrates 3D reconstruction models as geometry encoders within Vision-Language Models.
arXiv Detail & Related papers (2025-12-29T10:48:54Z)
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z)
SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards [37.39035418889281]
We introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning.<n>The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards.
arXiv Detail & Related papers (2025-11-10T18:52:47Z)
Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models [75.45940282834327]
We introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs.<n>We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs.<n>Our approach employs a two-stage fine-tuning strategy, resulting in significant improvements across multiple tasks.
arXiv Detail & Related papers (2025-11-03T14:27:00Z)
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective [103.44502230776352]
We present a systematic investigation of Visual Spatial Reasoning (VSR) in Vision-Language Models (VLMs)<n>We categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings.
arXiv Detail & Related papers (2025-09-23T12:00:14Z)
See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model [33.18304419115947]
We introduce SEE&TREK, the first training-free prompting framework to enhance the spatial understanding of Multimodal Large Language Models (MLLM) under vision-only constraints.<n>We focus on increasing visual diversity and motion reconstruction.<n>Our method is training&GPU-free, requiring only a single forward pass, and can be seamlessly integrated into existing MLLMS.
arXiv Detail & Related papers (2025-09-19T15:30:26Z)
Beyond Pixels: Introducing Geometric-Semantic World Priors for Video-based Embodied Models via Spatio-temporal Alignment [2.9493863710375674]
VEME is a novel method for achieving human-like reasoning in deep learning models for complex tasks in unknown environments.<n>Our framework integrates three key components: (1) a cross-language alignment framework bridging objects, spatial representations, and visual semantics with-temporal cues; (2) a dynamic, implicit cognitive activated world embedding to enable task-relevants memory recall; and (3) instruction instruction-based navigation and reasonings for long-term planning and efficient exploration.
arXiv Detail & Related papers (2025-08-29T19:47:25Z)
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z)
Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z)
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z)
Elite360M: Efficient 360 Multi-task Learning via Bi-projection Fusion and Cross-task Collaboration [5.577935944665]
360 cameras capture the entire surrounding environment with a large FoV, exhibiting comprehensive visual information to directly infer the 3D structures. Existing works predominantly specialize in a single task, leaving multi-task learning of 3D geometry and semantics largely unexplored. We propose a novel end-to-end multi-task learning framework, named Elite360M, capable of inferring 3D structures via depth and surface normal estimation, and semantics via semantic segmentation simultaneously.
arXiv Detail & Related papers (2024-08-18T02:33:45Z)
SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors [42.85605789984155]
Current state-of-the-art spatial reasoning-enhanced VLMs are trained to excel at spatial visual question answering (VQA) We present SpatialPIN, a framework designed to enhance the spatial reasoning capabilities of VLMs through prompting and interacting with priors from multiple 3D foundation models in a zero-shot, training-free manner. Our spatial reasoning-imbued VLM performs well on various forms of spatial VQA and can extend to help in various downstream robotics tasks such as pick and stack and trajectory planning.
arXiv Detail & Related papers (2024-03-18T17:38:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.