Understanding the Gain from Data Filtering in Multimodal Contrastive Learning
- URL: http://arxiv.org/abs/2512.14230v1
- Date: Tue, 16 Dec 2025 09:28:38 GMT
- Title: Understanding the Gain from Data Filtering in Multimodal Contrastive Learning
- Authors: Divyansh Pareek, Sewoong Oh, Simon S. Du,
- Abstract summary: We show the success of filtered contrastive learning under the standard bimodal data generation model.<n>We show that the error with teacher-based filtering is upper bounded by $frac1sqrtn$ in the large $$ regime, and by $frac1sqrtn$ in the small $$ regime.
- Score: 58.09454096371477
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting $η\in(0,1]$ as the fraction of data with correctly matched modalities among $n$ paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: $(i)$ the error without filtering is upper and lower bounded by $\frac{1}{η\sqrt{n}}$, and $(ii)$ the error with teacher-based filtering is upper bounded by $\frac{1}{\sqrt{ηn}}$ in the large $η$ regime, and by $\frac{1}{\sqrt{n}}$ in the small $η$ regime.
Related papers
- Stabilizing Open-Set Test-Time Adaptation via Primary-Auxiliary Filtering and Knowledge-Integrated Prediction [22.1791117523576]
Test-Time Adaptation (TTA) addresses the challenge by adapting the model to test data during inference.<n>We propose Primary-Auxiliary Filtering (PAF), which employs an auxiliary filter to validate data filtered by the primary filter.<n>We validate our approach across diverse closed-set and open-set datasets.
arXiv Detail & Related papers (2025-08-26T07:27:25Z) - Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [72.69498649272347]
conditional distributions is a central problem in machine learning.<n>We propose a new paradigm that integrates both paired and unpaired data.<n>We show that our approach can theoretically recover true conditional distributions with arbitrarily small error.
arXiv Detail & Related papers (2024-10-03T16:12:59Z) - Data-Efficient Learning via Clustering-Based Sensitivity Sampling:
Foundation Models and Beyond [28.651041302245538]
We present a new data selection approach based on $k$-means clustering and sampling sensitivity.
We show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling.
arXiv Detail & Related papers (2024-02-27T09:03:43Z) - Your Vision-Language Model Itself Is a Strong Filter: Towards
High-Quality Instruction Tuning with Data Selection [59.11430077029321]
We introduce a novel dataset selection method, Self-Filter, for vision-language models (VLMs)
In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM.
In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity.
arXiv Detail & Related papers (2024-02-19T20:08:48Z) - Data Filtering Networks [67.827994353269]
We study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset.
Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks.
Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets.
arXiv Detail & Related papers (2023-09-29T17:37:29Z) - Filter Pruning for Efficient CNNs via Knowledge-driven Differential
Filter Sampler [103.97487121678276]
Filter pruning simultaneously accelerates the computation and reduces the memory overhead of CNNs.
We propose a novel Knowledge-driven Differential Filter Sampler(KDFS) with Masked Filter Modeling(MFM) framework for filter pruning.
arXiv Detail & Related papers (2023-07-01T02:28:41Z) - Blessing of Class Diversity in Pre-training [54.335530406959435]
We prove that when the classes of the pre-training task are sufficiently diverse, pre-training can significantly improve the sample efficiency of downstream tasks.
Our proof relies on a vector-form Rademacher complexity chain rule for composite function classes and a modified self-concordance condition.
arXiv Detail & Related papers (2022-09-07T20:10:12Z) - Graph filtering over expanding graphs [14.84852576248587]
We propose a filter learning scheme for data over expanding graphs.
We show near-optimal performance compared with baselines relying on the exact topology.
These findings lay the foundation for learning representations over expanding graphs by relying only on the connectivity model.
arXiv Detail & Related papers (2022-03-15T16:50:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.