A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices
- URL: http://arxiv.org/abs/2505.16563v2
- Date: Tue, 10 Jun 2025 09:20:49 GMT
- Title: A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices
- Authors: Chen Gong, Rui Xing, Zhenzhe Zheng, Fan Wu,
- Abstract summary: Current on-device model training is hampered by low training throughput, limited storage and diverse data importance.<n>We propose a two-stage data selection framework sf Titan to select the most important data batch from streaming data for model training.<n>sf Titan achieves up to $43%$ reduction in training time and $6.2%$ increase in final accuracy with minor system overhead.
- Score: 18.853357902416832
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The demand for machine learning (ML) model training on edge devices is escalating due to data privacy and personalized service needs. However, we observe that current on-device model training is hampered by the under-utilization of on-device data, due to low training throughput, limited storage and diverse data importance. To improve data resource utilization, we propose a two-stage data selection framework {\sf Titan} to select the most important data batch from streaming data for model training with guaranteed efficiency and effectiveness. Specifically, in the first stage, {\sf Titan} filters out a candidate dataset with potentially high importance in a coarse-grained manner.In the second stage of fine-grained selection, we propose a theoretically optimal data selection strategy to identify the data batch with the highest model performance improvement to current training round. To further enhance time-and-resource efficiency, {\sf Titan} leverages a pipeline to co-execute data selection and model training, and avoids resource conflicts by exploiting idle computing resources. We evaluate {\sf Titan} on real-world edge devices and three representative edge computing tasks with diverse models and data modalities. Empirical results demonstrate that {\sf Titan} achieves up to $43\%$ reduction in training time and $6.2\%$ increase in final accuracy with minor system overhead, such as data processing delay, memory footprint and energy consumption.
Related papers
- LLM Data Selection and Utilization via Dynamic Bi-level Optimization [100.20933466418786]
We propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during training.<n>Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data.<n>We further analyze how a model's data preferences evolve throughout training, providing new insights into the data preference of the model during training.
arXiv Detail & Related papers (2025-07-22T02:47:12Z) - Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning [40.19639581728674]
Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment.<n>We propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned.<n>Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$times$ speedup.
arXiv Detail & Related papers (2025-05-18T03:10:00Z) - Federated Learning with Workload Reduction through Partial Training of Client Models and Entropy-Based Data Selection [3.9981390090442694]
We propose FedFT-EDS, a novel approach that combines Fine-Tuning of partial client models with Entropy-based Data Selection to reduce training workloads on edge devices.<n>Our experiments show that FedFT-EDS uses only 50% user data while improving the global model performance compared to baseline methods, FedAvg and FedProx.<n>FedFT-EDS improves client learning efficiency by up to 3 times, using one third of training time on clients to achieve an equivalent performance to the baselines.
arXiv Detail & Related papers (2024-12-30T22:47:32Z) - YuLan-Mini: An Open Data-efficient Language Model [111.02822724500552]
YuLan-Mini, a highly capable base model with 2.42B parameters, achieves top-tier performance among models of similar parameter scale.<n>Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data.
arXiv Detail & Related papers (2024-12-23T17:47:53Z) - TextGram: Towards a better domain-adaptive pretraining [0.3769303106863454]
In NLP, pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks.
We propose our own domain-adaptive data selection method - TextGram.
We show that the proposed strategy works better compared to other selection methods.
arXiv Detail & Related papers (2024-04-28T15:44:57Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - How to Train an Accurate and Efficient Object Detection Model on Any
Dataset [0.0]
We propose a dataset-agnostic template for object detection trainings.
It consists of carefully chosen and pre-trained models together with a robust training pipeline for further training.
Our solution works out-of-the-box and provides a strong baseline on a wide range of datasets.
arXiv Detail & Related papers (2022-11-30T17:09:01Z) - Knowledge Distillation as Efficient Pre-training: Faster Convergence,
Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks.
Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z) - Deep Reinforcement Learning Assisted Federated Learning Algorithm for
Data Management of IIoT [82.33080550378068]
The continuous expanded scale of the industrial Internet of Things (IIoT) leads to IIoT equipments generating massive amounts of user data every moment.
How to manage these time series data in an efficient and safe way in the field of IIoT is still an open issue.
This paper studies the FL technology applications to manage IIoT equipment data in wireless network environments.
arXiv Detail & Related papers (2022-02-03T07:12:36Z) - Efficient Conditional Pre-training for Transfer Learning [71.01129334495553]
We propose efficient filtering methods to select relevant subsets from the pre-training dataset.
We validate our techniques by pre-training on ImageNet in both the unsupervised and supervised settings.
We improve standard ImageNet pre-training by 1-3% by tuning available models on our subsets and pre-training on a dataset filtered from a larger scale dataset.
arXiv Detail & Related papers (2020-11-20T06:16:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.