Related papers: Concept-Aware Batch Sampling Improves Language-Image Pretraining

Concept-Aware Batch Sampling Improves Language-Image Pretraining

URL: http://arxiv.org/abs/2511.20643v1
Date: Tue, 25 Nov 2025 18:58:07 GMT
Title: Concept-Aware Batch Sampling Improves Language-Image Pretraining
Authors: Adhiraj Ghosh, Vishaal Udandarao, Thao Nguyen, Matteo Farina, Mehdi Cherti, Jenia Jitsev, Sewoong Oh, Elisa Ricci, Ludwig Schmidt, Matthias Bethge,
Abstract summary: Concept-Aware Batch Sampling (CABS) is a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly.<n>We show that CABS significantly benefits CLIP/SigLIP model classes and yields highly performant models.<n>Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms.
Score: 78.53540190580189
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.

Related papers

Unified Multi-Dataset Training for TBPS [7.745213180689951]
Existing TBPS methods rely on dataset-centric fine-tuning to handle distribution shift.<n>This motivates a fundamental question: can we train a single unified TBPS model across multiple datasets?<n>We show that naive joint training over all datasets remains sub-optimal because current training paradigms do not scale to a large number of unique person identities.<n>We propose Scale-TBPS with two contributions: (i) a noise-aware unified dataset curation strategy that cohesively merges diverse TBPS datasets; and (ii) a scalable discriminative identity learning framework.
arXiv Detail & Related papers (2026-01-21T13:26:28Z)
Online-PVLM: Advancing Personalized VLMs with Online Concept Learning [19.46716778297505]
Online-PVLM is a framework for online concept learning by leveraging hyperbolic representations.<n>We develop OP-Eval, a benchmark comprising 1,292 concepts and over 30K high-quality instances with diverse question types.
arXiv Detail & Related papers (2025-11-25T08:25:30Z)
Interpretable Reward Modeling with Active Concept Bottlenecks [54.00085739303773]
We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning.<n>Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts.<n>We formalize an active learning strategy that dynamically acquires the most informative concept labels.
arXiv Detail & Related papers (2025-07-07T06:26:04Z)
Offline Learning for Combinatorial Multi-armed Bandits [56.96242764723241]
Off-CMAB is the first offline learning framework for CMAB.<n>Off-CMAB combines pessimistic reward estimations with solvers.<n>Experiments on synthetic and real-world datasets highlight the superior performance of CLCB.
arXiv Detail & Related papers (2025-01-31T16:56:18Z)
Active Data Curation Effectively Distills Large-Scale Multimodal Models [66.23057263509027]
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones.<n>In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining.<n>Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations.
arXiv Detail & Related papers (2024-11-27T18:50:15Z)
Visual Data Diagnosis and Debiasing with Concept Graphs [50.84781894621378]
We present ConBias, a framework for diagnosing and mitigating Concept co-occurrence Biases in visual datasets. We show that by employing a novel clique-based concept balancing strategy, we can mitigate these imbalances, leading to enhanced performance on downstream tasks.
arXiv Detail & Related papers (2024-09-26T16:59:01Z)
Learning Multi-Aspect Item Palette: A Semantic Tokenization Framework for Generative Recommendation [55.99632509895994]
We introduce LAMIA, a novel approach for multi-aspect semantic tokenization.<n>Unlike RQ-VAE, which uses a single embedding, LAMIA learns an item palette''--a collection of independent and semantically parallel embeddings.<n>Our results demonstrate significant improvements in recommendation accuracy over existing methods.
arXiv Detail & Related papers (2024-09-11T13:49:48Z)
Explain via Any Concept: Concept Bottleneck Model with Open Vocabulary Concepts [8.028021897214238]
"OpenCBM" is the first CBM with concepts of open vocabularies. Our model significantly outperforms the previous state-of-the-art CBM by 9% in the classification accuracy on the benchmark dataset CUB-200-2011.
arXiv Detail & Related papers (2024-08-05T06:42:00Z)
Variational Information Pursuit with Large Language and Multimodal Models for Interpretable Predictions [9.07837207208113]
Variational Information Pursuit (V-IP) is a framework for making interpretable predictions by design. Applying V-IP to any task requires data samples with dense concept-labeling by domain experts. We extend the V-IP framework with Foundational Models (FMs) to address this limitation.
arXiv Detail & Related papers (2023-08-24T05:04:10Z)
Semi-supervised multi-view concept decomposition [30.699496411869834]
Concept Factorization (CF) has demonstrated superior performance in multi-view clustering tasks. We propose a novel semi-supervised multi-view concept factorization model, named SMVCF. We conduct experiments on four diverse datasets to evaluate the performance of SMVCF.
arXiv Detail & Related papers (2023-07-03T10:50:44Z)
Towards Explainable Collaborative Filtering with Taste Clusters Learning [43.4512681951459]
Collaborative Filtering (CF) is a widely used and effective technique for recommender systems. Adding explainability to recommendation models can not only increase trust in the decisionmaking process, but also have multiple benefits. We propose a neat and effective Explainable Collaborative Filtering (ECF) model that leverages interpretable cluster learning.
arXiv Detail & Related papers (2023-04-27T03:08:15Z)
Efficient Data-specific Model Search for Collaborative Filtering [56.60519991956558]
Collaborative filtering (CF) is a fundamental approach for recommender systems. In this paper, motivated by the recent advances in automated machine learning (AutoML), we propose to design a data-specific CF model. Key here is a new framework that unifies state-of-the-art (SOTA) CF methods and splits them into disjoint stages of input encoding, embedding function, interaction and prediction function.
arXiv Detail & Related papers (2021-06-14T14:30:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.