Solving Spatial Supersensing Without Spatial Supersensing
- URL: http://arxiv.org/abs/2511.16655v1
- Date: Thu, 20 Nov 2025 18:57:05 GMT
- Title: Solving Spatial Supersensing Without Spatial Supersensing
- Authors: Vishaal Udandarao, Shyamgopal Karthik, Surabhi S. Nath, Andreas Hochlehnert, Matthias Bethge, Ameya Prabhu,
- Abstract summary: Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing.<n>In this work, we conduct a critical analysis of Cambrian-S across two benchmarks.<n>We show that benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing.
- Score: 31.7966908405844
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity
Related papers
- MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence [61.065486539729875]
MMSI-Video-Bench is a fully human-annotated benchmark for video-based spatial intelligence in MLLMs.<n>It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips.<n>We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap.
arXiv Detail & Related papers (2025-12-11T17:57:24Z) - SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation [38.85929062825556]
We propose a superquadric-based occupancy model to enable real-time inference.<n>On the Occ3D dataset, SuperQuadricOcc achieves a 75% reduction in memory footprint and a 5.9% improvement in mIoU.<n>To our knowledge, this is the first occupancy model to enable real-time inference while maintaining competitive performance.
arXiv Detail & Related papers (2025-11-21T16:26:31Z) - Cambrian-S: Towards Spatial Supersensing in Video [78.46305169769884]
We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception, streaming event cognition, implicit 3D spatial cognition, and predictive world modeling.<n>To drive progress in spatial supersensing, we present-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting)<n>We then test data scaling limits by curating-590K and training Cambrian-S, achieving +30% absolute improvement on arbitrarily general capabilities.<n>We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised
arXiv Detail & Related papers (2025-11-06T18:55:17Z) - Scalable and adaptive prediction bands with kernel sum-of-squares [0.5530212768657544]
Conformal Prediction (CP) is a popular framework for constructing prediction bands with valid coverage in finite samples.<n>We build upon recent ideas that rely on recasting the CP problem as a statistical learning problem, directly targeting coverage and adaptivity.
arXiv Detail & Related papers (2025-05-27T11:21:17Z) - AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders [73.37603699731329]
We introduce AxBench, a large-scale benchmark for steering and concept detection.<n>For steering, we find that prompting outperforms all existing methods, followed by finetuning.<n>For concept detection, representation-based methods such as difference-in-means, perform the best.
arXiv Detail & Related papers (2025-01-28T18:51:24Z) - SC^2-PCR: A Second Order Spatial Compatibility for Efficient and Robust
Point Cloud Registration [32.87420625579577]
We propose a second order spatial compatibility (SC2) measure to compute the similarity between correspondences.
Based on this measure, our registration pipeline employs a global spectral technique to find some reliable seeds from the initial correspondences.
Our method can guarantee to find a certain number of outlier-free consensus sets using fewer samplings.
arXiv Detail & Related papers (2022-03-28T02:41:28Z) - VSAC: Efficient and Accurate Estimator for H and F [68.65610177368617]
VSAC is a RANSAC-type robust estimator with a number of novelties.
It is significantly faster than all its predecessors and runs on average in 1-2 ms, on a CPU.
It is two orders of magnitude faster and yet as precise as MAGSAC++, the currently most accurate estimator of two-view geometry.
arXiv Detail & Related papers (2021-06-18T17:04:57Z) - Learning to Estimate Hidden Motions with Global Motion Aggregation [71.12650817490318]
Occlusions pose a significant challenge to optical flow algorithms that rely on local evidences.
We introduce a global motion aggregation module to find long-range dependencies between pixels in the first image.
We demonstrate that the optical flow estimates in the occluded regions can be significantly improved without damaging the performance in non-occluded regions.
arXiv Detail & Related papers (2021-04-06T10:32:03Z) - A Comprehensive Comparison of End-to-End Approaches for Handwritten
Digit String Recognition [21.522563264752577]
We evaluate different end-to-end approaches to solve the HDSR problem, particularly in two verticals: those based on object-detection and sequence-to-sequence representation.
Our results show that the Yolo model compares favorably against segmentation-free models with the advantage of having a shorter pipeline.
arXiv Detail & Related papers (2020-10-29T19:38:08Z) - 1st Place Solutions for OpenImage2019 -- Object Detection and Instance
Segmentation [116.25081559037872]
This article introduces the solutions of the two champion teams, MMfruit' for the detection track and MMfruitSeg' for the segmentation track, in OpenImage Challenge 2019.
It is commonly known that for an object detector, the shared feature at the end of the backbone is not appropriate for both classification and regression.
We propose the Decoupling Head (DH) to disentangle the object classification and regression via the self-learned optimal feature extraction.
arXiv Detail & Related papers (2020-03-17T06:45:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.