Visuospatial Cognitive Assistant
- URL: http://arxiv.org/abs/2505.12312v1
- Date: Sun, 18 May 2025 08:55:02 GMT
- Title: Visuospatial Cognitive Assistant
- Authors: Qi Feng, Hidetoshi Shimodaira,
- Abstract summary: Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs)<n>We introduce ViCA (Visuospatial Cognitive Assistant)-322K, a dataset of 322,003 pairs from real-world indoor videos.<n>For interpretability, we present ViCAThinking-2.68K, a dataset with explicit reasoning chains, and finetune ViCA-7B to create ViCA-7B QAThinking.
- Score: 6.963160586041051
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs). This paper makes two key contributions. First, we introduce ViCA (Visuospatial Cognitive Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D metadata-grounded queries and video-based complex reasoning. Second, we develop ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all eight VSI-Bench tasks, outperforming existing models, including larger ones (e.g., +26.1 on Absolute Distance). For interpretability, we present ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial reasoning. Our work highlights the importance of targeted data and suggests paths for improved temporal-spatial modeling. We release all resources to foster research in robust visuospatial intelligence.
Related papers
- V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning [43.18609951839598]
A major challenge for modern AI is to learn to understand the world and act largely by observation.<n>This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data.<n>We develop models capable of understanding, predicting, and planning in the physical world.
arXiv Detail & Related papers (2025-06-11T17:57:09Z) - 3D Question Answering via only 2D Vision-Language Models [87.41421075243103]
Large vision-language models (LVLMs) have advanced numerous fields.<n>We explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example.<n>Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question.<n>We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA.
arXiv Detail & Related papers (2025-05-28T09:04:39Z) - Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts [6.963160586041051]
We introduce ViCA2 (Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial reasoning.<n>ViCA2 features a dual vision architecture integrating SigLIP for semantics and Hiera for spatial structure, coupled with a token ratio control mechanism for efficiency.<n>We also developed ViCA322K, a new large-scale cognition dataset with over 322,000 spatially grounded question-answer pairs.
arXiv Detail & Related papers (2025-05-18T10:57:33Z) - RoboSense: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments [62.5830455357187]
We setup an egocentric multi-sensor data collection platform based on 3 main types of sensors (Camera, LiDAR and Fisheye)<n>A large-scale multimodal dataset is constructed, named RoboSense, to facilitate egocentric robot perception.
arXiv Detail & Related papers (2024-08-28T03:17:40Z) - Pushing Boundaries: Exploring Zero Shot Object Classification with Large
Multimodal Models [0.09264362806173355]
Large Language and Vision Assistant models (LLVAs) engage users in rich conversational experiences intertwined with image-based queries.
This paper takes a unique perspective on LMMs, exploring their efficacy in performing image classification tasks using tailored prompts.
Our study includes a benchmarking analysis across four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees), and an unconventional dataset comprising Pox Vs. Non-Pox skin images.
arXiv Detail & Related papers (2023-12-30T03:19:54Z) - ViA: View-invariant Skeleton Action Representation Learning via Motion
Retargeting [10.811088895926776]
ViA is a novel View-Invariant Autoencoder for self-supervised skeleton action representation learning.
We conduct a study focusing on transfer-learning for skeleton-based action recognition with self-supervised pre-training on real-world data.
Our results showcase that skeleton representations learned from ViA are generic enough to improve upon state-of-the-art action classification accuracy.
arXiv Detail & Related papers (2022-08-31T18:49:38Z) - Florence: A New Foundation Model for Computer Vision [97.26333007250142]
We introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object)
By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks.
Florence achieves new state-of-the-art results in majority of 44 representative benchmarks.
arXiv Detail & Related papers (2021-11-22T18:59:55Z) - KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding
in 2D and 3D [67.50776195828242]
KITTI-360 is a suburban driving dataset which comprises richer input modalities, comprehensive semantic instance annotations and accurate localization.
For efficient annotation, we created a tool to label 3D scenes with bounding primitives, resulting in over 150k semantic and instance annotated images and 1B annotated 3D points.
We established benchmarks and baselines for several tasks relevant to mobile perception, encompassing problems from computer vision, graphics, and robotics on the same dataset.
arXiv Detail & Related papers (2021-09-28T00:41:29Z) - 3D AffordanceNet: A Benchmark for Visual Object Affordance Understanding [33.68455617113953]
We present a 3D AffordanceNet dataset, a benchmark of 23k shapes from 23 semantic object categories, annotated with 18 visual affordance categories.
Three state-of-the-art point cloud deep learning networks are evaluated on all tasks.
arXiv Detail & Related papers (2021-03-30T14:46:27Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z) - CRAVES: Controlling Robotic Arm with a Vision-based Economic System [96.56564257199474]
Training a robotic arm to accomplish real-world tasks has been attracting increasing attention in both academia and industry.<n>This work discusses the role of computer vision algorithms in this field.<n>We present an alternative solution, which uses a 3D model to create a large number of synthetic data.
arXiv Detail & Related papers (2018-12-03T13:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.