Related papers: TADS: Task-Aware Data Selection for Multi-Task Multimodal Pre-Training

TADS: Task-Aware Data Selection for Multi-Task Multimodal Pre-Training

URL: http://arxiv.org/abs/2602.05251v1
Date: Thu, 05 Feb 2026 03:08:45 GMT
Title: TADS: Task-Aware Data Selection for Multi-Task Multimodal Pre-Training
Authors: Guanjie Cheng, Boyi Li, Lingyu Sun, Mengying Zhu, Yangyang Wu, Xinkui Zhao, Shuiguang Deng,
Abstract summary: We introduce TADS (Task-Aware Data Selection), a novel framework for multi-task multimodal pre-training.<n> TADS integrates Intrinsic Quality, Task Relevance, and Distributional Diversity into a learnable value function.<n>A feedback-driven meta-learning mechanism adaptively refines the selection strategy based on proxy model performance.
Score: 29.962039479618543
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale multimodal pre-trained models like CLIP rely heavily on high-quality training data, yet raw web-crawled datasets are often noisy, misaligned, and redundant, leading to inefficient training and suboptimal generalization. Existing data selection methods are either heuristic-based, suffering from bias and limited diversity, or data-driven but task-agnostic, failing to optimize for multi-task scenarios. To address these gaps, we introduce TADS (Task-Aware Data Selection), a novel framework for multi-task multimodal pre-training that integrates Intrinsic Quality, Task Relevance, and Distributional Diversity into a learnable value function. TADS employs a comprehensive quality assessment system with unimodal and cross-modal operators, quantifies task relevance via interpretable similarity vectors, and optimizes diversity through cluster-based weighting. A feedback-driven meta-learning mechanism adaptively refines the selection strategy based on proxy model performance across multiple downstream tasks. Experiments on CC12M demonstrate that TADS achieves superior zero-shot performance on benchmarks like ImageNet, CIFAR-100, MS-COCO, and Flickr30K, using only 36% of the data while outperforming baselines by an average of 1.0%. This highlights that TADS significantly enhances data efficiency by curating a high-utility subset that yields a much higher performance ceiling within the same computational constraints.

Related papers

Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning [49.04912820721943]
Supervised fine-tuning (SFT) is computationally expensive and sometimes suffers from overfitting or bias amplification.<n>This work studies the online batch selection family that dynamically scores and filters samples during the training process.<n>We develop textbfUDS (Utility-Diversity Sampling), a framework for efficient online batch selection in SFT.
arXiv Detail & Related papers (2025-10-19T15:32:01Z)
CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization [14.304308878028358]
Multimodal large language models rely heavily on instruction tuning to align vision and language capabilities.<n>Existing data selection methods aim to select important and diverse subsets, but they often suffer from two critical drawbacks.<n>We introduce CoIDO, a novel dual-objective framework that jointly optimize data importance and diversity to overcome these challenges.
arXiv Detail & Related papers (2025-10-11T09:41:21Z)
IDEAL: Data Equilibrium Adaptation for Multi-Capability Language Model Alignment [29.703775936837012]
Large Language Models (LLMs) have achieved impressive performance through Supervised Fine-tuning (SFT) on diverse instructional datasets.<n>When training on multiple capabilities simultaneously, the mixture training dataset, governed by volumes of data from different domains, is a critical factor that directly impacts the final model's performance.<n>We introduce an innovative data equilibrium framework designed to effectively optimize volumes of data from different domains within mixture SFT datasets.
arXiv Detail & Related papers (2025-05-19T06:42:44Z)
Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning [40.19639581728674]
Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment.<n>We propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned.<n>Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$times$ speedup.
arXiv Detail & Related papers (2025-05-18T03:10:00Z)
Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness [63.484378941471114]
We propose a collaborative framework, DataTailor, which leverages three key principles--informativeness, uniqueness, and representativeness--for effective data selection.<n>Experiments on various benchmarks demonstrate that DataTailor achieves 101.3% of the performance of full-data fine-tuning with only 15% of the data.
arXiv Detail & Related papers (2024-12-09T08:36:10Z)
Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for lifelong instruction tuning.<n>We construct pseudo-skill clusters by grouping gradient-based sample vectors.<n>We select the best-performing data selector for each skill cluster from a pool of selector experts.<n>This data selector samples a subset of the most important samples from each skill cluster for training.
arXiv Detail & Related papers (2024-10-14T15:48:09Z)
AdapMTL: Adaptive Pruning Framework for Multitask Learning Model [5.643658120200373]
AdapMTL is an adaptive pruning framework for multitask models. It balances sparsity allocation and accuracy performance across multiple tasks. It showcases superior performance compared to state-of-the-art pruning methods.
arXiv Detail & Related papers (2024-08-07T17:19:15Z)
Multi-Teacher Multi-Objective Meta-Learning for Zero-Shot Hyperspectral Band Selection [50.30291173608449]
We propose a novel multi-objective meta-learning network (M$3$BS) for zero-shot hyperspectral band selection. In M$3$BS, a generalizable graph convolution network (GCN) is constructed to generate dataset-agnostic base. The acquired meta-knowledge can be directly transferred to unseen datasets without any retraining or fine-tuning.
arXiv Detail & Related papers (2024-06-12T07:13:31Z)
Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data. For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z)
Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis. For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.