NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization
- URL: http://arxiv.org/abs/2511.08417v1
- Date: Wed, 12 Nov 2025 01:58:08 GMT
- Title: NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization
- Authors: Xiyuan Wei, Chih-Jen Lin, Tianbao Yang,
- Abstract summary: Accurately estimating the normalization term in the contrastive loss is a central challenge for Contrastive Language-Image Pre-training models.<n>We propose NeuCLIP, a novel and elegant optimization framework based on two key ideas.<n>Experiments on large-scale CLIP training, spanning datasets from millions to billions of samples, demonstrate that NeuCLIP outperforms previous methods.
- Score: 42.298647858844895
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurately estimating the normalization term (also known as the partition function) in the contrastive loss is a central challenge for training Contrastive Language-Image Pre-training (CLIP) models. Conventional methods rely on large batches for approximation, demanding substantial computational resources. To mitigate this issue, prior works introduced per-sample normalizer estimators, which are updated at each epoch in a blockwise coordinate manner to keep track of updated encoders. However, this scheme incurs optimization error that scales with the ratio of dataset size to batch size, limiting effectiveness for large datasets or small batches. To overcome this limitation, we propose NeuCLIP, a novel and elegant optimization framework based on two key ideas: (i) $\textbf{reformulating}$ the contrastive loss for each sample $\textbf{via convex analysis}$ into a minimization problem with an auxiliary variable representing its log-normalizer; and (ii) $\textbf{transforming}$ the resulting minimization over $n$ auxiliary variables (where $n$ is the dataset size) via $\textbf{variational analysis}$ into the minimization over a compact neural network that predicts the log-normalizers. We design an alternating optimization algorithm that jointly trains the CLIP model and the auxiliary network. By employing a tailored architecture and acceleration techniques for the auxiliary network, NeuCLIP achieves more accurate normalizer estimation, leading to improved performance compared with previous methods. Extensive experiments on large-scale CLIP training, spanning datasets from millions to billions of samples, demonstrate that NeuCLIP outperforms previous methods.
Related papers
- Trajectory Consistency for One-Step Generation on Euler Mean Flows [24.038760671907024]
We propose emphEuler Mean Flows (EMF), a flow-based generative framework for one-step and few-step generation.<n>EMF enforces long-range trajectory consistency with minimal sampling cost.
arXiv Detail & Related papers (2026-01-31T04:32:32Z) - Breaking the Limits of Open-Weight CLIP: An Optimization Framework for Self-supervised Fine-tuning of CLIP [60.025820738301434]
TuneCLIP is a self-supervised fine-tuning framework for CLIP models.<n>It consistently improves performance across model architectures and scales.<n>It elevates leading open-weight models like SigLIP (ViT-B/16), achieving gains of up to +2.5% on ImageNet and related out-of-distribution benchmarks.
arXiv Detail & Related papers (2026-01-14T20:38:36Z) - Closing the Approximation Gap of Partial AUC Optimization: A Tale of Two Formulations [121.39938773554523]
The Area Under the ROC Curve (AUC) is a pivotal evaluation metric in real-world scenarios with both class imbalance and decision constraints.<n>We present two simple instance-wise minimax reformulations to close the approximation gap of PAUC optimization.<n>The resulting algorithms enjoy a linear per-iteration computational complexity w.r.t. the sample size and a convergence rate of $O(-2/3)$ for typical one-way and two-way PAUCs.
arXiv Detail & Related papers (2025-12-01T02:52:33Z) - Don't Be Greedy, Just Relax! Pruning LLMs via Frank-Wolfe [61.68406997155879]
State-of-the-art Large Language Model (LLM) pruning methods operate layer-wise, minimizing the per-layer pruning error on a small dataset to avoid full retraining.<n>Existing methods hence rely on greedy convexs that ignore the weight interactions in the pruning objective.<n>Our method drastically reduces the per-layer pruning error, outperforms strong baselines on state-of-the-art GPT architectures, and remains memory-efficient.
arXiv Detail & Related papers (2025-10-15T16:13:44Z) - Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning [71.30276778807068]
We propose a unified framework that strategically coordinates sample pruning and token pruning.<n>Q-Tuning achieves a +38% average improvement over the full-data SFT baseline using only 12.5% of the original training data.
arXiv Detail & Related papers (2025-09-28T13:27:38Z) - PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z) - To Theoretically Understand Transformer-Based In-Context Learning for Optimizing CSMA [26.87533852488578]
The binary exponential backoff scheme is widely used in WiFi 7 and still incurs poor throughput performance under dynamic channel environments.<n>Recent model-based approaches simply optimize backoff strategies under a known and fixed node density.<n>This paper is the first to propose transformer-based in-context learning (ICL) theory for optimizing channel access.
arXiv Detail & Related papers (2025-07-31T23:31:23Z) - Deep-ICE: The first globally optimal algorithm for empirical risk minimization of two-layer maxout and ReLU networks [1.7266553199919665]
This paper introduces the first globally optimal algorithm for the empirical risk problem of two-layer maxout and ReLU networks.<n>The proposed algorithm provides provably exact solutions for small-scale datasets.<n>To handle larger datasets, we introduce a novel coreset selection method that reduces the data size to a manageable scale.
arXiv Detail & Related papers (2025-05-09T02:34:54Z) - SAPPHIRE: Preconditioned Stochastic Variance Reduction for Faster Large-Scale Statistical Learning [18.055120576191204]
Ill-conditioned objectives and nonsmooth regularizers undermine the performance of traditional convex methods.<n>We propose a variance-free solution for ill-conditioned composite large-scale machine learning problems.
arXiv Detail & Related papers (2025-01-27T10:36:45Z) - Combinatorial optimization for low bit-width neural networks [23.466606660363016]
Low-bit width neural networks have been extensively explored for deployment on edge devices to reduce computational resources.
Existing approaches have focused on gradient-based optimization in a two-stage train-and-compress setting.
We show that a combination of greedy coordinate descent and this novel approach can attain competitive accuracy on binary classification tasks.
arXiv Detail & Related papers (2022-06-04T15:02:36Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.