Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts
- URL: http://arxiv.org/abs/2505.12363v3
- Date: Wed, 28 May 2025 08:29:25 GMT
- Title: Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts
- Authors: Qi Feng,
- Abstract summary: We introduce ViCA2 (Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial reasoning.<n>ViCA2 features a dual vision architecture integrating SigLIP for semantics and Hiera for spatial structure, coupled with a token ratio control mechanism for efficiency.<n>We also developed ViCA322K, a new large-scale cognition dataset with over 322,000 spatially grounded question-answer pairs.
- Score: 4.454997649515497
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Multimodal Large Language Models (MLLMs) excel at general vision-language tasks, visuospatial cognition - reasoning about spatial layouts, relations, and dynamics - remains a significant challenge. Existing models often lack the necessary architectural components and specialized training data for fine-grained spatial understanding. We introduce ViCA2 (Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial reasoning. ViCA2 features a dual vision encoder architecture integrating SigLIP for semantics and Hiera for spatial structure, coupled with a token ratio control mechanism for efficiency. We also developed ViCA-322K, a new large-scale dataset with over 322,000 spatially grounded question-answer pairs for targeted instruction tuning. On the challenging VSI-Bench benchmark, our ViCA2-7B model achieves a state-of-the-art average score of 56.8, significantly surpassing larger open-source models (e.g., LLaVA-NeXT-Video-72B, 40.9) and leading proprietary models (Gemini-1.5 Pro, 45.4). This demonstrates the effectiveness of our approach in achieving strong visuospatial intelligence with a compact model. We release ViCA2, its codebase, and the ViCA-322K dataset to facilitate further research.
Related papers
- CoCAViT: Compact Vision Transformer with Robust Global Coordination [8.041959685852085]
We present CoCAViT, a novel visual backbone designed for robust real-time visual representation.<n>At a resolution of 224*224, CoCAViT-28M achieves 84.0% top-1 accuracy on ImageNet-1K, with significant gains on multiple OOD benchmarks.<n>It also attains 52.2 mAP on object detection and 51.3 mIOU on ADE20K semantic segmentation, while maintaining low latency.
arXiv Detail & Related papers (2025-08-07T12:07:12Z) - Quizzard@INOVA Challenge 2025 -- Track A: Plug-and-Play Technique in Interleaved Multi-Image Model [0.5465345065283892]
We demonstrate the impressive performance of the LLaVA-NeXT-interleave on 22 datasets across three different tasks.<n>We add the Dense Channel Integration (DCI) connector to the LLaVA-NeXT-Interleave and compare its performance against the standard model.
arXiv Detail & Related papers (2025-06-13T12:48:39Z) - Visuospatial Cognitive Assistant [6.963160586041051]
Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs)<n>We introduce ViCA (Visuospatial Cognitive Assistant)-322K, a dataset of 322,003 pairs from real-world indoor videos.<n>For interpretability, we present ViCAThinking-2.68K, a dataset with explicit reasoning chains, and finetune ViCA-7B to create ViCA-7B QAThinking.
arXiv Detail & Related papers (2025-05-18T08:55:02Z) - Performance Analysis of Traditional VQA Models Under Limited Computational Resources [0.0]
This paper investigates the performance of traditional models under computational constraints.<n>We evaluate models based on Bidirectional GRU (BidGRU), GRU, Bidirectional LSTM (BidLSTM), and Convolutional Neural Networks (CNN)<n> Experimental results show that the BidGRU model with an embedding dimension of 300 and a vocabulary size of 3000 achieves the best overall performance without the computational overhead of larger models.
arXiv Detail & Related papers (2025-02-09T01:47:59Z) - Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.<n>InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.<n>We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z) - Hymba: A Hybrid-head Architecture for Small Language Models [65.94140459055244]
Hymba is a family of small language models featuring a hybrid-head parallel architecture.
We introduce learnable meta tokens that are prepended to prompts, storing critical information.
This model is further optimized by incorporating cross-layer key-value sharing and partial sliding window attention.
arXiv Detail & Related papers (2024-11-20T19:51:25Z) - Computer Vision Model Compression Techniques for Embedded Systems: A Survey [75.38606213726906]
This paper covers the main model compression techniques applied for computer vision tasks.
We present the characteristics of compression subareas, compare different approaches, and discuss how to choose the best technique.
We also share codes to assist researchers and new practitioners in overcoming initial implementation challenges.
arXiv Detail & Related papers (2024-08-15T16:41:55Z) - REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models [67.55362046790512]
Vision-language models lack the ability to correctly reason over spatial relationships.
We develop the REVISION framework which improves spatial fidelity in vision-language models.
Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware models.
arXiv Detail & Related papers (2024-08-05T04:51:46Z) - Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model [26.786890883280062]
State Space Models (SSMs) have garnered widespread attention due to their global receptive field and linear complexity.
To improve the performance of SSMs in vision tasks, a multi-scan strategy is widely adopted.
We introduce Multi-Scale Vision Mamba (MSVMamba) to preserve the superiority of SSMs in vision tasks with limited parameters.
arXiv Detail & Related papers (2024-05-23T04:59:49Z) - GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging
Cross-Modal Attention with Large Language Models [17.488420164181463]
This paper introduces a sophisticated encoder-decoder framework to address visual grounding in autonomous vehicles (AVs)
Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders-Text, Image, Context, and Cross-Modal-with a Multimodal decoder.
arXiv Detail & Related papers (2023-12-06T15:14:30Z) - General-Purpose Multimodal Transformer meets Remote Sensing Semantic
Segmentation [35.100738362291416]
Multimodal AI seeks to exploit complementary data sources, particularly for complex tasks like semantic segmentation.
Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance.
We propose a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously.
arXiv Detail & Related papers (2023-07-07T04:58:34Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - Revealing the Invisible with Model and Data Shrinking for
Composite-database Micro-expression Recognition [49.463864096615254]
We analyze the influence of learning complexity, including the input complexity and model complexity.
We propose a recurrent convolutional network (RCN) to explore the shallower-architecture and lower-resolution input data.
We develop three parameter-free modules to integrate with RCN without increasing any learnable parameters.
arXiv Detail & Related papers (2020-06-17T06:19:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.