Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key
- URL: http://arxiv.org/abs/2410.10210v2
- Date: Tue, 15 Oct 2024 13:21:19 GMT
- Title: Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key
- Authors: Yingda Chen, Xingjun Wang, Jintao Huang, Yunlin Mao, Daoze Zhang, Yuze Zhao,
- Abstract summary: We show that it is possible to achieve notable performance improvement in tuned models with a small fraction of training data instances and compute.
Our findings suggest that, while capacities for generating long output vary across different models out-of-the-box, our approach to tune them with high-quality data using lite compute, consistently yields notable improvement across all models we experimented on.
- Score: 3.3339400603549265
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models rapidly evolve to support longer context, there is a notable disparity in their capability to generate output at greater lengths. Recent study suggests that the primary cause for this imbalance may arise from the lack of data with long-output during alignment training. In light of this observation, attempts are made to re-align foundation models with data that fills the gap, which result in models capable of generating lengthy output when instructed. In this paper, we explore the impact of data-quality in tuning a model for long output, and the possibility of doing so from the starting points of human-aligned (instruct or chat) models. With careful data curation, we show that it possible to achieve similar performance improvement in our tuned models, with only a small fraction of training data instances and compute. In addition, we assess the generalizability of such approaches by applying our tuning-recipes to several models. our findings suggest that, while capacities for generating long output vary across different models out-of-the-box, our approach to tune them with high-quality data using lite compute, consistently yields notable improvement across all models we experimented on. We have made public our curated dataset for tuning long-writing capability, the implementations of model tuning and evaluation, as well as the fine-tuned models, all of which can be openly-accessed.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development [67.55944651679864]
We present a novel sandbox suite tailored for integrated data-model co-development.
This sandbox provides a comprehensive experimental platform, enabling rapid iteration and insight-driven refinement of both data and models.
We also uncover fruitful insights gleaned from exhaustive benchmarks, shedding light on the critical interplay between data quality, diversity, and model behavior.
arXiv Detail & Related papers (2024-07-16T14:40:07Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Towards Continually Learning Application Performance Models [1.2278517240988065]
Machine learning-based performance models are increasingly being used to build critical job scheduling and application optimization decisions.
Traditionally, these models assume that data distribution does not change as more samples are collected over time.
We develop continually learning performance models that account for the distribution drift, alleviate catastrophic forgetting, and improve generalizability.
arXiv Detail & Related papers (2023-10-25T20:48:46Z) - Farzi Data: Autoregressive Data Distillation [34.39112473620335]
We study data distillation for auto-regressive machine learning tasks.
We propose Farzi, which summarizes an event sequence dataset into a small number of synthetic sequences.
arXiv Detail & Related papers (2023-10-15T23:23:27Z) - Lag-Llama: Towards Foundation Models for Probabilistic Time Series
Forecasting [54.04430089029033]
We present Lag-Llama, a general-purpose foundation model for time series forecasting based on a decoder-only transformer architecture.
Lag-Llama is pretrained on a large corpus of diverse time series data from several domains, and demonstrates strong zero-shot generalization capabilities.
When fine-tuned on relatively small fractions of such previously unseen datasets, Lag-Llama achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-10-12T12:29:32Z) - Adapting Large Language Models for Content Moderation: Pitfalls in Data
Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains.
In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - GradTail: Learning Long-Tailed Data Using Gradient-based Sample
Weighting [15.418627530276598]
We show that an approach based on gradient dot product agreement can isolate long-tailed data early on during model training and improve performance by dynamically picking higher sample weights for that data.
We show that such upweighting leads to model improvements for both classification and regression models, the latter of which are relatively unexplored in the long-tail literature.
arXiv Detail & Related papers (2022-01-16T00:37:39Z) - Generative time series models using Neural ODE in Variational
Autoencoders [0.0]
We implement Neural Ordinary Differential Equations in a Variational Autoencoder setting for generative time series modeling.
An object-oriented approach to the code was taken to allow for easier development and research.
arXiv Detail & Related papers (2022-01-12T14:38:11Z) - Exploring the Efficacy of Automatically Generated Counterfactuals for
Sentiment Analysis [17.811597734603144]
We propose an approach to automatically generating counterfactual data for data augmentation and explanation.
A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance.
arXiv Detail & Related papers (2021-06-29T10:27:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.