SPACE-CLIP: Spatial Perception via Adaptive CLIP Embeddings for Monocular Depth Estimation
- URL: http://arxiv.org/abs/2601.17657v1
- Date: Sun, 25 Jan 2026 02:32:01 GMT
- Title: SPACE-CLIP: Spatial Perception via Adaptive CLIP Embeddings for Monocular Depth Estimation
- Authors: Taewan Cho, Taeryang Kim, Andrew Jaeyong Choi,
- Abstract summary: We present SPACE-CLIP, an architecture that unlocks and interprets latent geometric knowledge directly from a frozen CLIP vision encoder.<n>A semantic pathway interprets high-level features, dynamically conditioned on global context.<n>A structural pathway extracts fine-grained spatial details from early layers.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) has accomplished extraordinary success for semantic understanding but inherently struggles to perceive geometric structure. Existing methods attempt to bridge this gap by querying CLIP with textual prompts, a process that is often indirect and inefficient. This paper introduces a fundamentally different approach using a dual-pathway decoder. We present SPACE-CLIP, an architecture that unlocks and interprets latent geometric knowledge directly from a frozen CLIP vision encoder, completely bypassing the text encoder and its associated textual prompts. A semantic pathway interprets high-level features, dynamically conditioned on global context using feature-wise linear modulation (FiLM). In addition, a structural pathway extracts fine-grained spatial details from early layers. These complementary streams are hierarchically fused, enabling a robust synthesis of semantic context and precise geometry. Extensive experiments on the KITTI benchmark show that SPACE-CLIP dramatically outperforms previous CLIP-based methods. Our ablation studies validate that the synergistic fusion of our dual pathways is critical to this success. SPACE-CLIP offers a new, efficient, and architecturally elegant blueprint for repurposing large-scale vision models. The proposed method is not just a standalone depth estimator, but a readily integrable spatial perception module for the next generation of embodied AI systems, such as vision-language-action (VLA) models. Our model is available at https://github.com/taewan2002/space-clip
Related papers
- Towards Pixel-Level VLM Perception via Simple Points Prediction [27.271487302305726]
We present SimpleSeg, a strikingly simple yet highly effective approach to endow Multimodal Large Language Models (MLLMs) with native pixel-level perception.<n>Our method reframes segmentation as a simple sequence generation problem: the model directly predicts sequences of points.<n>We find that the standard MLLM architecture possesses a strong, inherent capacity for low-level perception that can be unlocked without any specialized architecture.
arXiv Detail & Related papers (2026-01-27T05:50:40Z) - SuperCLIP: CLIP with Simple Classification Supervision [88.86549733903314]
Contrastive Language-Image Pretraining achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space.<n>Recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text.<n>We propose SuperCLIP, a framework that augments contrastive learning with classification-based supervision.
arXiv Detail & Related papers (2025-12-16T15:11:53Z) - SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion [23.86761713752287]
Multimodal large language models (MLLMs) have achieved significant progress in image and language tasks.<n>Most MLLMs suffer from limited spatial reasoning ability to interpret and infer spatial arrangements in three-dimensional space.<n>We propose a novel vision encoder based on hierarchical fusion of geometry and semantics features, generating spatial-aware visual embedding.
arXiv Detail & Related papers (2025-11-21T15:24:33Z) - CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting [53.15827818829865]
Methods that rely on 2D priors are prone to a critical challenge: cross-view semantic inconsistencies.<n>We propose CCL-LGS, a novel framework that enforces view-consistent semantic supervision by integrating multi-view semantic cues.<n>Our framework explicitly resolves semantic conflicts while preserving category discriminability.
arXiv Detail & Related papers (2025-05-26T19:09:33Z) - Is CLIP ideal? No. Can we fix it? Yes! [30.71718499767702]
Contrastive Language-Image Pre-Training is a popular method for learning multimodal latent spaces with well-organized semantics.<n>Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions.<n>We propose Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models.
arXiv Detail & Related papers (2025-03-10T23:42:04Z) - Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation [90.35249276717038]
We propose WeCLIP, a CLIP-based single-stage pipeline, for weakly supervised semantic segmentation.
Specifically, the frozen CLIP model is applied as the backbone for semantic feature extraction.
A new decoder is designed to interpret extracted semantic features for final prediction.
arXiv Detail & Related papers (2024-06-17T03:49:47Z) - Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation [72.47110803885235]
We introduce a novel framework named Cascade-CLIP for zero-shot semantic segmentation.
Our framework achieves superior zero-shot performance on segmentation benchmarks like COCO-Stuff, Pascal-VOC, and Pascal-Context.
arXiv Detail & Related papers (2024-06-02T08:32:51Z) - CLIP Can Understand Depth [6.877245323116022]
We show that CLIP can be adapted to downstream tasks where its vision-language alignment is suboptimally learned during pre-training on web-crawled data.<n>We distill the semantic prior of its frozen text encoder into a single learnable embedding matrix called "mirror"<n>The resulting model exhibits impressive performance, matching several state-of-the-art vision models on the NYU Depth v2 and KITTI benchmark datasets.
arXiv Detail & Related papers (2024-02-05T18:09:33Z) - Symmetrical Linguistic Feature Distillation with CLIP for Scene Text
Recognition [77.93678598476149]
We establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR)
By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow.
Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.
arXiv Detail & Related papers (2023-10-08T04:00:20Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot
Learning [14.496173899477283]
We study the problem of Compositional Zero-Shot Learning (CZSL), which is to recognize novel attribute-object combinations with pre-existing concepts.
We propose to insert adapters, a parameter-efficient technique proven to be effective among large language models, into each CLIP encoder layer.
We further equip adapters with concept awareness so that concept-specific features of "object", "attribute", and "composition" can be extracted.
arXiv Detail & Related papers (2023-05-26T07:02:57Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.<n>We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 13 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.