Related papers: D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models

D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models

URL: http://arxiv.org/abs/2511.15411v1
Date: Wed, 19 Nov 2025 13:08:25 GMT
Title: D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models
Authors: Wenlun Zhang, Yunshan Zhong, Zihao Ding, Xinyu Li, Kentaro Yoshioka,
Abstract summary: We propose D4C, the first Data-Free Quantization (DFQ) framework tailored for Vision-Language Models (CLIP)<n>D4C synthesizes semantically rich and structurally diverse pseudo images through three key components.<n>Experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models.
Score: 10.318833207091162
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: (1) Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; (2) Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and (3) Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models. For example, under the W4A8 setting with CLIP ResNet-50 and ViT-B/32, D4C achieves Top-1 accuracy improvement of 12.4% and 18.9% on CIFAR-10, 6.8% and 19.7% on CIFAR-100, and 1.4% and 5.7% on ImageNet-1K in zero-shot classification, respectively.

Related papers

Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach [0.7446442872036001]
Role- SynthCLIP is a novel data synthesis framework that leverages multi-perspective role-playing prompts.<n>It enhances semantic diversity and fine-grained image-text alignment of synthetic pairs.<n>A CLIP-B/16 model trained on only 1 million Role- SynthCLIP pairs achieves a Recall@1 of 64.1% on the MS COCO validation set.
arXiv Detail & Related papers (2025-11-07T08:03:53Z)
Deeply-Conditioned Image Compression via Self-Generated Priors [75.29511865838812]
We introduce a framework predicated on functional decomposition, which we term Deeply-Conditioned Image Compression via self-generated priors (DCIC-sgp)<n>Our framework achieves significant BD-rate reductions of 14.4%, 15.7%, and 15.1% against the VVC test model VTM-12.1 on the Kodak, CLIC, and Tecnick datasets.
arXiv Detail & Related papers (2025-10-28T14:04:19Z)
Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification [31.116511358786084]
Text-to-image (T2I) models are increasingly used for synthetic dataset generation.<n>Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data.<n>We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification.
arXiv Detail & Related papers (2025-10-28T05:40:14Z)
DeeCLIP: A Robust and Generalizable Transformer-Based Framework for Detecting AI-Generated Images [14.448350657613368]
DeeCLIP is a novel framework for detecting AI-generated images.<n>It incorporates DeeFuser, a fusion module that combines high-level and low-level features.<n>We trained exclusively on 4-class ProGAN data, DeeCLIP achieves an average accuracy of 89.90%.
arXiv Detail & Related papers (2025-04-28T15:06:28Z)
Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models [21.46605047406198]
Diffusion-4K is a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models.<n>We construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation.<n>We propose a wavelet-based fine-tuning approach for direct training with 4K images, applicable to various latent diffusion models.
arXiv Detail & Related papers (2025-03-24T05:25:07Z)
CLIPure: Purification in Latent Space via CLIP for Adversarially Robust Zero-Shot Classification [65.46685389276443]
We ground our work on CLIP, a vision-language pre-trained encoder model that can perform zero-shot classification by matching an image with text prompts.<n>We then formulate purification risk as the KL divergence between the joint distributions purification process.<n>We propose two variants for our CLIPure approach: CLI-Diff which models the likelihood of images' latent vectors, and CLIPure-Cos which models the likelihood with the cosine similarity between the embeddings of an image and a photo of a.''
arXiv Detail & Related papers (2025-02-25T13:09:34Z)
Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers [58.80845404416028]
Data-free quantization (DFQ) enables model quantization without accessing real data, addressing concerns regarding data security and privacy.<n>With the growing adoption of Vision Transformers (ViTs), DFQ for ViTs has garnered significant attention.<n>We propose SARDFQ, a novel Semantics Alignment and Reinforcement Data-Free Quantization method for ViTs.
arXiv Detail & Related papers (2024-12-21T09:30:45Z)
$\texttt{BATCLIP}$: Bimodal Online Test-Time Adaptation for CLIP [18.278043899825267]
Open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) have demonstrated strong zero-shot learning capabilities.<n>We show that zero-shot CLIP lacks robustness to common image corruptions during test-time.<n>We propose $textttBATCLIP$, a bimodal $textbfonline$ TTA method designed to improve CLIP's robustness to common image corruptions.
arXiv Detail & Related papers (2024-12-03T21:02:14Z)
Comb, Prune, Distill: Towards Unified Pruning for Vision Model Compression [24.119415458653616]
We propose a novel unified pruning framework Comb, Prune, Distill (CPD) to address both model-agnostic and task-agnostic concerns simultaneously. Our framework employs a combing step to resolve hierarchical layer-wise dependency issues, enabling architecture independence. In image classification we achieve a speedup of up to x4.3 with a accuracy loss of 1.8% and in semantic segmentation up to x1.89 with a 5.1% loss in mIoU.
arXiv Detail & Related papers (2024-08-06T09:02:31Z)
Probabilistic-based Feature Embedding of 4-D Light Fields for Compressive Imaging and Denoising [62.347491141163225]
4-D light field (LF) poses great challenges in achieving efficient and effective feature embedding. We propose a probabilistic-based feature embedding (PFE), which learns a feature embedding architecture by assembling various low-dimensional convolution patterns. Our experiments demonstrate the significant superiority of our methods on both real-world and synthetic 4-D LF images.
arXiv Detail & Related papers (2023-06-15T03:46:40Z)
UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot Vision-Language Tasks [60.46473247205654]
Using large-scale unsupervised unimodal models as pre-training can enhance the zero-shot performance of image-text pair models. Our experiments show that unimodal pre-training outperforms state-of-the-art CLIP-based models.
arXiv Detail & Related papers (2023-06-07T18:26:22Z)
Improving Zero-shot Generalization and Robustness of Multi-modal Models [70.14692320804178]
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks. We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. We propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy.
arXiv Detail & Related papers (2022-12-04T07:26:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.