Related papers: Training Task Experts through Retrieval Based Distillation

Training Task Experts through Retrieval Based Distillation

URL: http://arxiv.org/abs/2407.05463v1
Date: Sun, 7 Jul 2024 18:27:59 GMT
Title: Training Task Experts through Retrieval Based Distillation
Authors: Jiaxin Ge, Xueying Jia, Vijay Viswanathan, Hongyin Luo, Graham Neubig,
Abstract summary: We present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data. Our method significantly improves performance by up to 7.8% on SQuAD, 1.37% on MNLI, and 1.94% on BigBench-Hard.
Score: 55.46054242512261
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: One of the most reliable ways to create deployable models for specialized tasks is to obtain an adequate amount of high-quality task-specific data. However, for specialized tasks, often such datasets do not exist. Existing methods address this by creating such data from large language models (LLMs) and then distilling such knowledge into smaller models. However, these methods are limited by the quality of the LLMs output, and tend to generate repetitive or incorrect data. In this work, we present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data. This method greatly enhances data diversity. Moreover, ReBase generates Chain-of-Thought reasoning and distills the reasoning capacity of LLMs. We test our method on 4 benchmarks and results show that our method significantly improves performance by up to 7.8% on SQuAD, 1.37% on MNLI, and 1.94% on BigBench-Hard.

Related papers

DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training [16.441081996257576]
Large language models (LLMs) have recently achieved remarkable performance on various complex reasoning benchmarks. We construct a large-scale, difficulty-graded reasoning dataset containing about 3.34 million unique queries of varying difficulty levels. We significantly improve the reasoning capabilities of the base model, achieving a pass rate of 79.2% on the AIME2024 mathematical reasoning benchmark.
arXiv Detail & Related papers (2025-04-24T13:57:53Z)
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding [61.15402517835137]
We build a supervised fine-tuning (SFT) dataset to achieve state-of-the-art coding capability results in models of various sizes. Our models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, surpassing alternatives trained with reinforcement learning.
arXiv Detail & Related papers (2025-04-02T17:50:31Z)
LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering [1.0874597293913013]
Multiple Choice Question Answering (MCQA) is an important problem with numerous real-world applications, such as medicine, law, and education. We propose a simple yet effective approach that uses Large Language Models for data generation and scoring. Our method improves accuracy from 28.9% to 39.3%, representing a gain of over 10% compared to a baseline finetuned directly on 5-shot examples.
arXiv Detail & Related papers (2024-12-13T02:48:36Z)
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales. We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options. Our method is able to work under black-box conditions without access to model training data or weights. We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z)
MAmmoTH2: Scaling Instructions from the Web [39.786198452175505]
We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus. We build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-05-06T15:11:38Z)
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process. We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z)
GOLD: Generalized Knowledge Distillation via Out-of-Distribution-Guided Language Data Generation [21.56082253577229]
Gold is a task-agnostic data generation and knowledge distillation framework. It employs an iterative out-of-distribution-guided feedback mechanism for the LLM. An energy-based OOD evaluation approach is also introduced to deal with noisy generated data.
arXiv Detail & Related papers (2024-03-28T18:08:22Z)
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
EntGPT: Linking Generative Large Language Models with Knowledge Bases [9.067856411512427]
The ability of Large Language Models to generate factually correct output remains relatively unexplored. We design a three-step hard-prompting method to probe LLMs' ED performance without supervised fine-tuning. We further improve the knowledge grounding ability through instruction tuning (IT) with similar prompts and responses.
arXiv Detail & Related papers (2024-02-09T19:16:27Z)
Distilling from Similar Tasks for Transfer Learning on a Budget [38.998980344852846]
Transfer learning is an effective solution for training with few labels, however often at the expense of a computationally costly fine-tuning of large base models. We propose to mitigate this unpleasant trade-off between compute and accuracy via semi-supervised cross-domain distillation. Our methods need no access to source data, and merely need features and pseudo-labels of the source models.
arXiv Detail & Related papers (2023-04-24T17:59:01Z)
Complementary Ensemble Learning [1.90365714903665]
We derive a technique to improve performance of state-of-the-art deep learning models. Specifically, we train auxiliary models which are able to complement state-of-the-art model uncertainty.
arXiv Detail & Related papers (2021-11-09T03:23:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.