Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
- URL: http://arxiv.org/abs/2509.22615v1
- Date: Fri, 26 Sep 2025 17:41:57 GMT
- Title: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
- Authors: Yasmine Omri, Connor Ding, Tsachy Weissman, Thierry Tambe,
- Abstract summary: Modern vision language pipelines are driven by RGB vision encoders trained on massive image text corpora.<n>These pipelines inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy intensive and costly, and (ii) patch based tokenization explodes sequence length.<n>We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment.
- Score: 4.2390854432099205
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern vision language pipelines are driven by RGB vision encoders trained on massive image text corpora. While these pipelines have enabled impressive zero shot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy intensive and costly, and (ii) patch based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance aware pruning, and batched CUDA kernels, achieving over 90x faster fitting and about 97% GPU utilization compared to prior implementations. We further adapt contrastive language image pretraining (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat aware input stem and a perceiver resampler, training only about 7% of the total parameters. On large DataComp subsets, GS encoders yield meaningful zero shot ImageNet-1K performance while compressing inputs 3 to 20x relative to pixels. While accuracy currently trails RGB encoders, our results establish 2DGS as a viable multimodal substrate, pinpoint architectural bottlenecks, and open a path toward representations that are both semantically powerful and transmission efficient for edge cloud learning.
Related papers
- Structure-Guided Allocation of 2D Gaussians for Image Representation and Compression [26.855464287699366]
We propose a structure-guided allocation principle for 2DGS, which explicitly couples image structure with both representation capacity and quantization precision.<n>We show that our approach substantially improves both the representational power and the gradient performance of 2DGS while maintaining over 1000 FPS decoding.
arXiv Detail & Related papers (2025-12-30T06:35:46Z) - Contour Information Aware 2D Gaussian Splatting for Image Representation [0.0]
We propose a Contour Information-Aware 2D Gaussian Splatting framework.<n>Our method achieves higher reconstruction quality around object edges compared to existing 2DGS methods.
arXiv Detail & Related papers (2025-12-29T07:24:36Z) - Generative Latent Coding for Ultra-Low Bitrate Image Compression [61.71793017252801]
We introduce a Generative Latent Coding architecture, which performs transform coding in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE), instead of in the pixel space.<n>The generative latent space is characterized by greater sparsity, richer semantic and better alignment with human perception, rendering it advantageous for achieving high-realism and high-fidelity compression.
arXiv Detail & Related papers (2025-12-23T09:35:40Z) - Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation [37.57424511974552]
We propose GSDD, a novel and efficient sparse representation for dataset distillation based on 2D Gaussians.<n>Instead of representing all pixels equally, GSDD encodes critical discnative information in a distilled image using only small number of Gaussian primitives.<n>Experiments show that GSDD achieves state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet subsets, while remaining highly efficient encoding and decoding cost.
arXiv Detail & Related papers (2025-09-30T13:19:05Z) - GViT: Representing Images as Gaussians for Visual Recognition [54.46109876668194]
We introduce GVIT, a classification framework that abandons conventional pixel or patch grid input representations in favor of a compact set of learnable 2D Gaussians.<n>We demonstrate that by 2D Gaussian input representations coupled with our GVIT guidance, using a relatively standard ViT architecture, closely matches the performance of a traditional patch-based ViT.
arXiv Detail & Related papers (2025-06-30T05:44:14Z) - HDBFormer: Efficient RGB-D Semantic Segmentation with A Heterogeneous Dual-Branch Framework [0.0]
In RGB-D semantic segmentation for indoor scenes, a key challenge is effectively integrating the rich color information from RGB images with the spatial distance information from depth images.<n>We propose a novel heterogeneous dual-branch framework called HDBFormer, specifically designed to handle these modality differences.<n>For RGB images, which contain rich detail, we employ both a basic and detail encoder to extract local and global features.<n>For the simpler depth images, we propose LDFormer, a lightweight hierarchical encoder that efficiently extracts depth features with fewer parameters.
arXiv Detail & Related papers (2025-04-18T09:29:46Z) - GaussianToken: An Effective Image Tokenizer with 2D Gaussian Splatting [64.84383010238908]
We propose an effective image tokenizer with 2D Gaussian Splatting as a solution.<n>In general, our framework integrates the local influence of 2D Gaussian distribution into the discrete space.<n> Competitive reconstruction performances on CIFAR, Mini-Net, and ImageNet-1K demonstrate the effectiveness of our framework.
arXiv Detail & Related papers (2025-01-26T17:56:11Z) - SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images [125.66499135980344]
We propose SparseGrasp, a novel open-vocabulary robotic grasping system.<n>SparseGrasp operates efficiently with sparse-view RGB images and handles scene updates fastly.<n>We show that SparseGrasp significantly outperforms state-of-the-art methods in terms of both speed and adaptability.
arXiv Detail & Related papers (2024-12-03T03:56:01Z) - Multispectral Texture Synthesis using RGB Convolutional Neural Networks [2.3213238782019316]
State-of-the-art RGB texture synthesis algorithms rely on style distances that are computed through statistics of deep features.
We propose two solutions to extend these methods to multispectral imaging.
arXiv Detail & Related papers (2024-10-21T13:49:54Z) - Image-GS: Content-Adaptive Image Representation via 2D Gaussians [52.598772767324036]
We introduce Image-GS, a content-adaptive image representation based on 2D Gaussians radiance.<n>It supports hardware-friendly rapid access for real-time usage, requiring only 0.3K MACs to decode a pixel.<n>We demonstrate its versatility with several applications, including texture compression, semantics-aware compression, and joint image compression and restoration.
arXiv Detail & Related papers (2024-07-02T00:45:21Z) - Robust Double-Encoder Network for RGB-D Panoptic Segmentation [31.807572107839576]
Panoptic segmentation provides an interpretation of the scene by computing a pixelwise semantic label together with instance IDs.
We propose a novel encoder-decoder neural network that processes RGB and depth separately through two encoders.
We show that our approach achieves superior results compared to other common approaches for panoptic segmentation.
arXiv Detail & Related papers (2022-10-06T11:46:37Z) - Parallel Discrete Convolutions on Adaptive Particle Representations of
Images [2.362412515574206]
We present data structures and algorithms for native implementations of discrete convolution operators over Adaptive Particle Representations.
The APR is a content-adaptive image representation that locally adapts the sampling resolution to the image signal.
We show that APR convolution naturally leads to scale-adaptive algorithms that efficiently parallelize on multi-core CPU and GPU architectures.
arXiv Detail & Related papers (2021-12-07T09:40:05Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.