TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge
- URL: http://arxiv.org/abs/2510.21879v1
- Date: Thu, 23 Oct 2025 14:53:32 GMT
- Title: TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge
- Authors: Shu-Hao Zhang, Wei-Cheng Tang, Chen Wu, Peng Hu, Nan Li, Liang-Jie Zhang, Qi Zhang, Shao-Qun Zhang,
- Abstract summary: TernaryCLIP is a lightweight framework that converts connection weights of both vision and text encoders of CLIP into the ternary format.<n>Our work highlights the feasibility of extreme quantization for large multimodal models, supporting effective and efficient deployment on resource-constrained devices.
- Score: 23.707347449137895
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Recent years have witnessed an increasing interest in image-text contrastive modeling, exemplified by models such as Contrastive Language-Image Pretraining (CLIP). In this paper, we propose the TernaryCLIP, a lightweight computational framework that converts connection weights of both vision and text encoders of CLIP into the ternary format, instead of full-precision or floating ones. TernaryCLIP incorporates quantization-aware training and distillation modules, preventing precision degradation and enabling low-cost and high-efficiency computations. Comprehensive experiments demonstrate that TernaryCLIP can achieve up to 99\% ternarized weights with 1.58-bit representation, 16.98 $\times$ compression ratio, 2.3 $\times$ inference acceleration, 16 $\times$ storage reduction, 10 $\times$ memory optimization, and 60\% sparsity while maintaining promising performance on zero-shot image classification and image-text retrieval tasks across 41 commonly used datasets. Our work highlights the feasibility of extreme quantization for large multimodal models, supporting effective and efficient deployment on resource-constrained devices. The model and code can be accessed from Hugging Face and GitHub.
Related papers
- DiP: Taming Diffusion Models in Pixel Space [91.51011771517683]
Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction.<n>Co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details.
arXiv Detail & Related papers (2025-11-24T06:55:49Z) - Learning Hyperspectral Images with Curated Text Prompts for Efficient Multimodal Alignment [1.7188280334580195]
We make an attempt to optimize a Vision-Language Model (VLM) for hyperspectral scene understanding by exploiting a CLIP-style contrastive training framework.<n>Our framework maps voxel-level embeddings from a vision backbone onto the latent space of a frozen large embedding model.<n>It is seen that the proposed method updates only 0.07 percent of the total parameters, yet yields state-of-the-art performance.
arXiv Detail & Related papers (2025-09-20T23:23:04Z) - Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation [4.063715077687089]
Distill CLIP (DCLIP) is a fine-tuned variant of the CLIP model.<n>It enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities.
arXiv Detail & Related papers (2025-05-25T07:08:07Z) - AmorLIP: Efficient Language-Image Pretraining via Amortization [52.533088120633785]
Contrastive Language-Image Pretraining (CLIP) has demonstrated strong zero-shot performance across diverse downstream text-image tasks.<n>We propose AmorLIP, an efficient CLIP pretraining framework that amortizes expensive computations involved in contrastive learning through lightweight neural networks.
arXiv Detail & Related papers (2025-05-25T05:30:37Z) - FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources [45.40926501138365]
We introduce FastCLIP, a general CLIP training framework built on advanced compositional optimization techniques.
Our framework is equipped with an efficient gradient reduction strategy to reduce communication overhead.
We benchmark the performance of FastCLIP and the state-of-the-art training baseline on different compute scales.
arXiv Detail & Related papers (2024-07-01T16:37:18Z) - ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations [67.25974711647481]
Text-to-image (T2I) diffusion models, notably the unCLIP models, achieve state-of-the-art (SOTA) performance on various compositional T2I benchmarks.
We introduce ECLIPSE, a novel contrastive learning method that is both parameter and data-efficient.
We demonstrate that ECLIPSE trained prior, with only 3.3% of the parameters and trained on a mere 2.8% of the data, surpasses the baseline T2I priors with an average of 71.6% preference score.
arXiv Detail & Related papers (2023-12-07T19:32:39Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z) - CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy
within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy [20.495259430916814]
We present CLIPA-v2, an inverse scaling law for CLIP training.
We extend the experiments up to the H/14 model with 13B image-text pairs.
Our model achieves an impressive zero-shot ImageNet accuracy of 81.1%.
arXiv Detail & Related papers (2023-06-27T17:51:06Z) - Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale
From A New Perspective [27.650434284271363]
Under 50 IPC, our approach achieves the highest 42.5% and 60.8% validation accuracy on Tiny-ImageNet and ImageNet-1K datasets.
Our approach also surpasses MTT in terms of speed by approximately 52$times$ (ConvNet-4) and 16$times$ (ResNet-18) faster with less memory consumption of 11.6$times$ and 6.4$times$ during data synthesis.
arXiv Detail & Related papers (2023-06-22T17:59:58Z) - Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory [66.035487142452]
We show that trajectory-matching-based methods (MTT) can scale to large-scale datasets such as ImageNet-1K.
We propose a procedure to exactly compute the unrolled gradient with constant memory complexity, which allows us to scale MTT to ImageNet-1K seamlessly with 6x reduction in memory footprint.
The resulting algorithm sets new SOTA on ImageNet-1K: we can scale up to 50 IPCs (Image Per Class) on ImageNet-1K on a single GPU.
arXiv Detail & Related papers (2022-11-19T04:46:03Z) - Supervision Exists Everywhere: A Data Efficient Contrastive
Language-Image Pre-training Paradigm [109.0573737034428]
Large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks.
This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP) to alleviate this limitation.
We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently.
arXiv Detail & Related papers (2021-10-11T12:17:32Z) - Highly Efficient Salient Object Detection with 100K Parameters [137.74898755102387]
We propose a flexible convolutional module, namely generalized OctConv (gOctConv), to efficiently utilize both in-stage and cross-stages multi-scale features.
We build an extremely light-weighted model, namely CSNet, which achieves comparable performance with about 0.2% (100k) of large models on popular object detection benchmarks.
arXiv Detail & Related papers (2020-03-12T07:00:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.