Jellyfish: A Large Language Model for Data Preprocessing
- URL: http://arxiv.org/abs/2312.01678v6
- Date: Mon, 28 Oct 2024 22:24:01 GMT
- Title: Jellyfish: A Large Language Model for Data Preprocessing
- Authors: Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada,
- Abstract summary: We consider instruction-tuning local LLMs (7 -- 13B models) as universal DP task solvers that operate on a local, single, and low-priced GPU.
By tuning Mistral-7B, Llama 3-8B, and OpenOrca-Platypus2-13B, our models, namely, Jellyfish-7B/8B/13B, deliver competitiveness compared to GPT-3.5/4 models.
- Score: 9.99065004972981
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format conducive to easy processing. Whereas the use of LLMs has sparked interest in devising universal solutions to DP, recent initiatives in this domain typically rely on GPT APIs, raising inevitable data breach concerns. Unlike these approaches, we consider instruction-tuning local LLMs (7 -- 13B models) as universal DP task solvers that operate on a local, single, and low-priced GPU, ensuring data security and enabling further customization. We select a collection of datasets across four representative DP tasks and construct instruction tuning data using data configuration, knowledge injection, and reasoning data distillation techniques tailored to DP. By tuning Mistral-7B, Llama 3-8B, and OpenOrca-Platypus2-13B, our models, namely, Jellyfish-7B/8B/13B, deliver competitiveness compared to GPT-3.5/4 models and strong generalizability to unseen tasks while barely compromising the base models' abilities in NLP tasks. Meanwhile, Jellyfish offers enhanced reasoning capabilities compared to GPT-3.5. Our models are available at: https://huggingface.co/NECOUDBFM/Jellyfish . Our instruction dataset is available at: https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct .
Related papers
- DONOD: Robust and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning [22.704995231753397]
Ad-hoc instruction fine-tuning of large language models (LLMs) is widely adopted for domain-specific adaptation.
We propose DONOD, a lightweight model-intrinsic data pruning method.
By filtering out 70% of the full dataset, we improve target-domain accuracy by 14.90% and cross-domain accuracy by 5.67%.
arXiv Detail & Related papers (2025-04-21T02:25:03Z) - The Best Instruction-Tuning Data are Those That Fit [17.401088816596054]
Supervised fine-tuning (SFT) data are crucial for eliciting strong capabilities from pretrained large language models (LLMs)
We propose **GRAPE**, a novel SFT framework that accounts for the unique characteristics of the target model.
For each instruction, it gathers responses from various LLMs and selects the one with the highest probability measured by the target model.
arXiv Detail & Related papers (2025-02-06T16:31:21Z) - CoddLLM: Empowering Large Language Models for Data Analytics [38.23203246023766]
Large Language Models (LLMs) have the potential to revolutionize data analytics.
We unveil a new data recipe for post-Turbo synthesiss.
We posttrain a new foundation model, named CoddLLM, based on MistralNeMo-12B.
arXiv Detail & Related papers (2025-02-01T06:03:55Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Training Task Experts through Retrieval Based Distillation [55.46054242512261]
We present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data.
Our method significantly improves performance by up to 7.8% on SQuAD, 1.37% on MNLI, and 1.94% on BigBench-Hard.
arXiv Detail & Related papers (2024-07-07T18:27:59Z) - DataComp-LM: In search of the next generation of training sets for language models [200.5293181577585]
DataComp for Language Models (DCLM) is a testbed for controlled dataset experiments with the goal of improving language models.
We provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations.
Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters.
arXiv Detail & Related papers (2024-06-17T17:42:57Z) - Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing [48.07915731998946]
We present a self-synthesis method for generating large-scale alignment data named Magpie.
We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses.
Our results indicate that in some tasks, models fine-tuned with Magpie perform comparably to the official Llama-3-8B-Instruct.
arXiv Detail & Related papers (2024-06-12T17:52:30Z) - LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks.
We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset.
We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - DavIR: Data Selection via Implicit Reward for Large Language Models [62.59514469369608]
DavIR is a model-based data selection method for post-training Large Language Models.
We show that 6% of Alpaca dataset selected with DavIR can steer both the LLaMA and Gemma model family to produce superior performance compared to the same models trained on the full 52K dataset.
arXiv Detail & Related papers (2023-10-16T07:26:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.