Concept-Aware Batch Sampling Improves Language-Image Pretraining
- URL: http://arxiv.org/abs/2511.20643v1
- Date: Tue, 25 Nov 2025 18:58:07 GMT
- Title: Concept-Aware Batch Sampling Improves Language-Image Pretraining
- Authors: Adhiraj Ghosh, Vishaal Udandarao, Thao Nguyen, Matteo Farina, Mehdi Cherti, Jenia Jitsev, Sewoong Oh, Elisa Ricci, Ludwig Schmidt, Matthias Bethge,
- Abstract summary: Concept-Aware Batch Sampling (CABS) is a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly.<n>We show that CABS significantly benefits CLIP/SigLIP model classes and yields highly performant models.<n>Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms.
- Score: 78.53540190580189
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.
Related papers
- Unified Multi-Dataset Training for TBPS [7.745213180689951]
Existing TBPS methods rely on dataset-centric fine-tuning to handle distribution shift.<n>This motivates a fundamental question: can we train a single unified TBPS model across multiple datasets?<n>We show that naive joint training over all datasets remains sub-optimal because current training paradigms do not scale to a large number of unique person identities.<n>We propose Scale-TBPS with two contributions: (i) a noise-aware unified dataset curation strategy that cohesively merges diverse TBPS datasets; and (ii) a scalable discriminative identity learning framework.
arXiv Detail & Related papers (2026-01-21T13:26:28Z) - Online-PVLM: Advancing Personalized VLMs with Online Concept Learning [19.46716778297505]
Online-PVLM is a framework for online concept learning by leveraging hyperbolic representations.<n>We develop OP-Eval, a benchmark comprising 1,292 concepts and over 30K high-quality instances with diverse question types.
arXiv Detail & Related papers (2025-11-25T08:25:30Z) - Interpretable Reward Modeling with Active Concept Bottlenecks [54.00085739303773]
We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning.<n>Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts.<n>We formalize an active learning strategy that dynamically acquires the most informative concept labels.
arXiv Detail & Related papers (2025-07-07T06:26:04Z) - Offline Learning for Combinatorial Multi-armed Bandits [56.96242764723241]
Off-CMAB is the first offline learning framework for CMAB.<n>Off-CMAB combines pessimistic reward estimations with solvers.<n>Experiments on synthetic and real-world datasets highlight the superior performance of CLCB.
arXiv Detail & Related papers (2025-01-31T16:56:18Z) - Active Data Curation Effectively Distills Large-Scale Multimodal Models [66.23057263509027]
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones.<n>In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining.<n>Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations.
arXiv Detail & Related papers (2024-11-27T18:50:15Z) - Visual Data Diagnosis and Debiasing with Concept Graphs [50.84781894621378]
We present ConBias, a framework for diagnosing and mitigating Concept co-occurrence Biases in visual datasets.
We show that by employing a novel clique-based concept balancing strategy, we can mitigate these imbalances, leading to enhanced performance on downstream tasks.
arXiv Detail & Related papers (2024-09-26T16:59:01Z) - Learning Multi-Aspect Item Palette: A Semantic Tokenization Framework for Generative Recommendation [55.99632509895994]
We introduce LAMIA, a novel approach for multi-aspect semantic tokenization.<n>Unlike RQ-VAE, which uses a single embedding, LAMIA learns an item palette''--a collection of independent and semantically parallel embeddings.<n>Our results demonstrate significant improvements in recommendation accuracy over existing methods.
arXiv Detail & Related papers (2024-09-11T13:49:48Z) - Explain via Any Concept: Concept Bottleneck Model with Open Vocabulary Concepts [8.028021897214238]
"OpenCBM" is the first CBM with concepts of open vocabularies.
Our model significantly outperforms the previous state-of-the-art CBM by 9% in the classification accuracy on the benchmark dataset CUB-200-2011.
arXiv Detail & Related papers (2024-08-05T06:42:00Z) - Variational Information Pursuit with Large Language and Multimodal
Models for Interpretable Predictions [9.07837207208113]
Variational Information Pursuit (V-IP) is a framework for making interpretable predictions by design.
Applying V-IP to any task requires data samples with dense concept-labeling by domain experts.
We extend the V-IP framework with Foundational Models (FMs) to address this limitation.
arXiv Detail & Related papers (2023-08-24T05:04:10Z) - Semi-supervised multi-view concept decomposition [30.699496411869834]
Concept Factorization (CF) has demonstrated superior performance in multi-view clustering tasks.
We propose a novel semi-supervised multi-view concept factorization model, named SMVCF.
We conduct experiments on four diverse datasets to evaluate the performance of SMVCF.
arXiv Detail & Related papers (2023-07-03T10:50:44Z) - Towards Explainable Collaborative Filtering with Taste Clusters Learning [43.4512681951459]
Collaborative Filtering (CF) is a widely used and effective technique for recommender systems.
Adding explainability to recommendation models can not only increase trust in the decisionmaking process, but also have multiple benefits.
We propose a neat and effective Explainable Collaborative Filtering (ECF) model that leverages interpretable cluster learning.
arXiv Detail & Related papers (2023-04-27T03:08:15Z) - Efficient Data-specific Model Search for Collaborative Filtering [56.60519991956558]
Collaborative filtering (CF) is a fundamental approach for recommender systems.
In this paper, motivated by the recent advances in automated machine learning (AutoML), we propose to design a data-specific CF model.
Key here is a new framework that unifies state-of-the-art (SOTA) CF methods and splits them into disjoint stages of input encoding, embedding function, interaction and prediction function.
arXiv Detail & Related papers (2021-06-14T14:30:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.