Collaborative Unlabeled Data Optimization
- URL: http://arxiv.org/abs/2505.14117v1
- Date: Tue, 20 May 2025 09:21:40 GMT
- Title: Collaborative Unlabeled Data Optimization
- Authors: Xinyi Shang, Peng Sun, Fengyuan Liu, Tao Lin,
- Abstract summary: This paper pioneers a novel data-centric paradigm to maximize the utility of unlabeled data.<n>By distributing unlabeled data and leveraging publicly available task-agnostic models, CoOpt facilitates scalable, reusable, and sustainable training pipelines.
- Score: 6.512302544770766
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper pioneers a novel data-centric paradigm to maximize the utility of unlabeled data, tackling a critical question: How can we enhance the efficiency and sustainability of deep learning training by optimizing the data itself? We begin by identifying three key limitations in existing model-centric approaches, all rooted in a shared bottleneck: knowledge extracted from data is locked to model parameters, hindering its reusability and scalability. To this end, we propose CoOpt, a highly efficient, parallelized framework for collaborative unlabeled data optimization, thereby effectively encoding knowledge into the data itself. By distributing unlabeled data and leveraging publicly available task-agnostic models, CoOpt facilitates scalable, reusable, and sustainable training pipelines. Extensive experiments across diverse datasets and architectures demonstrate its efficacy and efficiency, achieving 13.6% and 6.8% improvements on Tiny-ImageNet and ImageNet-1K, respectively, with training speedups of $1.94 \times $ and $1.2 \times$.
Related papers
- Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning [40.19639581728674]
Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment.<n>We propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned.<n>Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$times$ speedup.
arXiv Detail & Related papers (2025-05-18T03:10:00Z) - Quality over Quantity: Boosting Data Efficiency Through Ensembled Multimodal Data Curation [4.030723722142048]
This paper tackles the challenges associated with the unstructured and heterogeneous nature of webcrawl datasets.<n>We introduce an advanced, learning-driven approach, Ensemble Curation Of DAta ThroUgh Multimodal Operators (EcoDatum)<n>EcoDatum strategically integrates various unimodal and multimodal data curation operators within a weak supervision ensemble framework.<n>It ranked 1st on the DataComp leaderboard, with an average performance score of 0.182 across 38 diverse evaluation datasets.
arXiv Detail & Related papers (2025-02-12T08:40:57Z) - Data Assetization via Resources-decoupled Federated Learning [7.347554648348435]
Federated learning (FL) provides an effective approach to collaborative training models while preserving privacy.<n>We first propose a framework for resource-decoupled FL involving three parties.<n>Next, we propose the Quality-aware Dynamic Resources-decoupled FL algorithm (QD-RDFL)
arXiv Detail & Related papers (2025-01-24T15:49:04Z) - YuLan-Mini: An Open Data-efficient Language Model [111.02822724500552]
YuLan-Mini, a highly capable base model with 2.42B parameters, achieves top-tier performance among models of similar parameter scale.<n>Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data.
arXiv Detail & Related papers (2024-12-23T17:47:53Z) - Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [64.28420991770382]
Data-Juicer 2.0 is a data processing system backed by data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, annotation, and foundation model post-training.<n>It has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z) - Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness [65.01625761120924]
We argue that a valuable sample should be informative of the task, non-redundant, and represent the sample distribution (i.e., not an outlier)<n>We propose a collaborative framework, DataTailor, which leverages three key principles--informativeness, uniqueness, and representativeness--for effective data selection.<n>Experiments on various benchmarks demonstrate that DataTailor achieves 100.8% of the performance of full-data fine-tuning with only 15% of the data.
arXiv Detail & Related papers (2024-12-09T08:36:10Z) - DataSculpt: Crafting Data Landscapes for Long-Context LLMs through Multi-Objective Partitioning [32.914155560286225]
The key to improving long-context performance lies in effective data organization and management strategies.
We introduce DataSculpt, a novel data management framework designed for long-context training.
Our evaluations demonstrate that DataSculpt significantly enhances long-context training performance.
arXiv Detail & Related papers (2024-09-02T07:23:13Z) - Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution [62.71425232332837]
We show that training amortized models with noisy labels is inexpensive and surprisingly effective.
This approach significantly accelerates several feature attribution and data valuation methods, often yielding an order of magnitude speedup over existing approaches.
arXiv Detail & Related papers (2024-01-29T03:42:37Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and
Training Efficiency via Efficient Data Sampling and Routing [57.86954315102865]
DeepSpeed Data Efficiency is a framework that makes better use of data, increases training efficiency, and improves model quality.
For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost, while still maintaining 95% of model quality compared to baseline with full data and cost.
For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost.
arXiv Detail & Related papers (2022-12-07T12:27:28Z) - Infinite Recommendation Networks: A Data-Centric Approach [8.044430277912936]
We leverage the Neural Tangent Kernel to train infinitely-wide neural networks to devise $infty$-AE: an autoencoder with infinitely-wide bottleneck layers.
We also develop Distill-CF for synthesizing tiny, high-fidelity data summaries.
We observe 96-105% of $infty$-AE's performance on the full dataset with as little as 0.1% of the original dataset size.
arXiv Detail & Related papers (2022-06-03T00:34:13Z) - FedDUAP: Federated Learning with Dynamic Update and Adaptive Pruning
Using Shared Data on the Server [64.94942635929284]
Federated Learning (FL) suffers from two critical challenges, i.e., limited computational resources and low training efficiency.
We propose a novel FL framework, FedDUAP, to exploit the insensitive data on the server and the decentralized data in edge devices.
By integrating the two original techniques together, our proposed FL model, FedDUAP, significantly outperforms baseline approaches in terms of accuracy (up to 4.8% higher), efficiency (up to 2.8 times faster), and computational cost (up to 61.9% smaller)
arXiv Detail & Related papers (2022-04-25T10:00:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.