Related papers: LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models

LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models

URL: http://arxiv.org/abs/2411.11289v1
Date: Mon, 18 Nov 2024 05:17:27 GMT
Title: LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models
Authors: Yungi Kim, Hyunsoo Ha, Seonghoon Yang, Sukyung Lee, Jihoo Kim, Chanjun Park,
Abstract summary: We introduce the Lightweight, Purpose-driven (LP) Data Pipeline, a framework that operates entirely on CPUs. Based on our four core principles, the LP Data Pipeline significantly reduces preparation time and cost while maintaining high data quality.
Score: 2.060383637820238
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Creating high-quality, large-scale datasets for large language models (LLMs) often relies on resource-intensive, GPU-accelerated models for quality filtering, making the process time-consuming and costly. This dependence on GPUs limits accessibility for organizations lacking significant computational infrastructure. To address this issue, we introduce the Lightweight, Purpose-driven (LP) Data Pipeline, a framework that operates entirely on CPUs to streamline the processes of dataset extraction, filtering, and curation. Based on our four core principles, the LP Data Pipeline significantly reduces preparation time and cost while maintaining high data quality. Importantly, our pipeline enables the creation of purpose-driven datasets tailored to specific domains and languages, enhancing the applicability of LLMs in specialized contexts. We anticipate that our pipeline will lower the barriers to LLM development, enabling a wide range of organizations to access LLMs more easily.

Related papers

Are Large Language Models Good Data Preprocessors? [5.954202581988127]
High-quality textual training data is essential for the success of multimodal data processing tasks. outputs from image captioning models like BLIP and GIT often contain errors and anomalies that are difficult to rectify using rule-based methods.
arXiv Detail & Related papers (2025-02-24T02:57:21Z)
LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression. LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model. Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z)
Role-RL: Online Long-Context Processing with Role Reinforcement Learning for Distinct LLMs in Their Optimal Roles [13.64363652226897]
Large language models (LLMs) with long-context processing are still challenging because of their implementation complexity, training efficiency and data sparsity. Online Long-context Processing (OLP) is proposed when we process a document of unlimited length, which typically occurs in the information reception and organization of diverse streaming media such as automated news reporting, live e-commerce, and viral short videos. We also develop Role Reinforcement Learning (Role-RL) to automatically deploy different LLMs in their respective roles within the OLP pipeline according to their actual performance.
arXiv Detail & Related papers (2024-09-26T16:22:59Z)
LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs [11.664088080448593]
"LlamaDuo" is a pipeline for migrating knowledge and abilities from service-oriented large language models to smaller, locally manageable models. Our pipeline is crucial for ensuring service continuity in the presence of operational failures, strict privacy policies, or offline requirements.
arXiv Detail & Related papers (2024-08-24T05:03:08Z)
Automated Federated Pipeline for Parameter-Efficient Fine-Tuning of Large Language Models [21.864109456867784]
For many downstream tasks, it is necessary to fine-tune large language models (LLMs) using private data. We propose an automated federated pipeline, named FedPipe, to fine-tune LLMs with minimal training cost. Extensive experiments demonstrate that FedPipe expedites the model training and achieves higher accuracy than state-of-the-art benchmarks.
arXiv Detail & Related papers (2024-04-09T16:50:30Z)
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z)
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development. This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z)
Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime. We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z)
FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning [70.38817963253034]
This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution. We provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios. We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings.
arXiv Detail & Related papers (2023-09-01T09:40:36Z)
Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models [11.845239346943067]
parameter-efficient fine-tuning (PEFT) is a promising approach to efficiently specialize large language models (LLMs) to task-specific data. Our study highlights the potential for tuning larger LLMs and significant reductions in memory usage by combining PEFT with quantization.
arXiv Detail & Related papers (2023-08-21T04:31:06Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines. We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.