Related papers: BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

URL: http://arxiv.org/abs/2408.15079v1
Date: Tue, 27 Aug 2024 14:08:23 GMT
Title: BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline
Authors: Guosheng Dong, Da Pan, Yiding Sun, Shusen Zhang, Zheng Liang, Xin Wu, Yanjun Shen, Fan Yang, Haoze Sun, Tianpeng Li, Mingan Lin, Jianhua Xu, Yufan Zhang, Xiaonan Nie, Lei Su, Bingning Wang, Wentao Zhang, Jiaxin Mao, Zenan Zhou, Weipeng Chen,
Abstract summary: General capabilities of Large Language Models (LLM) highly rely on extensive pretraining datasets, treated as commercial secrets by several institutions. We open-source the details of a universally applicable data processing pipeline to validate its effectiveness and potential. BaichuanSEED demonstrates consistency and predictability throughout training and achieves comparable performance on comprehensive benchmarks.
Score: 34.518474035662905
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The general capabilities of Large Language Models (LLM) highly rely on the composition and selection on extensive pretraining datasets, treated as commercial secrets by several institutions. To mitigate this issue, we open-source the details of a universally applicable data processing pipeline and validate its effectiveness and potential by introducing a competitive LLM baseline. Specifically, the data processing pipeline consists of broad collection to scale up and reweighting to improve quality. We then pretrain a 7B model BaichuanSEED with 3T tokens processed by our pipeline without any deliberate downstream task-related optimization, followed by an easy but effective supervised fine-tuning stage. BaichuanSEED demonstrates consistency and predictability throughout training and achieves comparable performance on comprehensive benchmarks with several commercial advanced large language models, such as Qwen1.5 and Llama3. We also conduct several heuristic experiments to discuss the potential for further optimization of downstream tasks, such as mathematics and coding.

Related papers

InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities [27.09178257629886]
InfiAlign is a scalable and sample-efficient post-training framework for large language models (LLMs)<n>At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning.<n>Our results highlight the effectiveness of combining principled data selection with full-stage post-training.
arXiv Detail & Related papers (2025-08-07T15:34:06Z)
Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance [38.362162910767466]
We conduct the first comprehensive analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk.<n>We derive statistics that reveal structural and qualitative similarities and differences between the two datasets.<n>Our findings offer actionable insights for constructing more effective post-training datasets.
arXiv Detail & Related papers (2025-06-06T20:34:06Z)
Federated Continual Instruction Tuning [39.344583304181135]
Federated learning (FL) has the potential to leverage all distributed data and training resources to reduce the overhead of joint training. We introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge. Our proposed method significantly enhances model performance across varying levels of data and catastrophic forgetting.
arXiv Detail & Related papers (2025-03-17T07:58:06Z)
Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM [45.510445021130685]
Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances. Existing works mainly use data packing to alleviate this issue but fail to consider imbalanced attention computation and wasted communication overhead. This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies.
arXiv Detail & Related papers (2025-03-10T10:52:50Z)
Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework [81.29965270493238]
We develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) for wireless communication applications. The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard. We introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data.
arXiv Detail & Related papers (2025-01-16T16:19:53Z)
Aligning Instruction Tuning with Pre-training [81.4748965653345]
We propose Aligning Instruction Tuning with Pre-training (AITP) to align instruction tuning with pre-training distributions. We show consistent performance improvements with AITP on three fully open large language models (LLMs) across eight benchmarks.
arXiv Detail & Related papers (2025-01-16T08:27:40Z)
Federated Fine-Tuning of LLMs: Framework Comparison and Research Directions [59.5243730853157]
Federated learning (FL) provides a privacy-preserving solution for fine-tuning pre-trained large language models (LLMs) using distributed private datasets. This article conducts a comparative analysis of three advanced federated LLM (FedLLM) frameworks that integrate knowledge distillation (KD) and split learning (SL) to mitigate these issues.
arXiv Detail & Related papers (2025-01-08T11:37:06Z)
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales. We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z)
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
We introduce a preference optimization process to enhance the multimodal reasoning capabilities of MLLMs. We develop a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B.
arXiv Detail & Related papers (2024-11-15T18:59:27Z)
Bucket Pre-training is All You Need [9.332544709626875]
Large language models (LLMs) have demonstrated exceptional performance across various natural language processing tasks. The conventional fixed-length data composition strategy for pretraining, which involves concatenating and splitting documents, can introduce noise and limit the model's ability to capture long-range dependencies. We propose a multi-bucket data composition method that moves beyond the fixed-length paradigm, offering a more flexible and efficient approach to pretraining.
arXiv Detail & Related papers (2024-07-10T09:27:23Z)
Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large Language Models Reasoning [64.5243480989869]
Instruction Fine-Tuning (IFT) significantly enhances the zero-shot capabilities of pretrained Large Language Models (LLMs) This paper investigates how coding data impact LLMs' reasoning capacities during the IFT stage.
arXiv Detail & Related papers (2024-05-30T23:20:25Z)
3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset [13.808860456901204]
We introduce a scalable 3D benchmark, accompanied by a large-scale instruction-tuning dataset known as 3DBench. Specifically, we establish the benchmark that spans a wide range of spatial and semantic scales, from object-level to scene-level. We present a rigorous pipeline for automatically constructing scalable 3D instruction-tuning datasets, covering 10 diverse multi-modal tasks with more than 0.23 million QA pairs generated in total.
arXiv Detail & Related papers (2024-04-23T02:06:10Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
Zero- and Few-Shots Knowledge Graph Triplet Extraction with Large Language Models [7.919349589245355]
In this work, we tested the Triplet Extraction capabilities of a variety of Large Language Models (LLMs) of different sizes in the Zero- and Few-Shots settings. We proposed a pipeline that dynamically gathers contextual information from a Knowledge Base (KB), both in the form of context triplets and of (sentence, triplets) pairs as examples.
arXiv Detail & Related papers (2023-12-04T15:12:04Z)
FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning [70.38817963253034]
This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution. We provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios. We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings.
arXiv Detail & Related papers (2023-09-01T09:40:36Z)
Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training [44.790636524264]
Point Prompt Training is a novel framework for multi-dataset synergistic learning in the context of 3D representation learning. It can overcome the negative transfer associated with synergistic learning and produce generalizable representations. It achieves state-of-the-art performance on each dataset using a single weight-shared model with supervised multi-dataset training.
arXiv Detail & Related papers (2023-08-18T17:59:57Z)
SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training. We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z)
Cross-Modal Fine-Tuning: Align then Refine [83.37294254884446]
ORCA is a cross-modal fine-tuning framework that extends the applicability of a single large-scale pretrained model to diverse modalities. We show that ORCA obtains state-of-the-art results on 3 benchmarks containing over 60 datasets from 12 modalities.
arXiv Detail & Related papers (2023-02-11T16:32:28Z)
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information [77.80071279597665]
We propose an all-in-one single-stage pre-training approach, named Maximizing Multi-modal Mutual Information Pre-training (M3I Pre-training) Our approach achieves better performance than previous pre-training methods on various vision benchmarks, including ImageNet classification, object detection, LVIS long-tailed object detection, and ADE20k semantic segmentation.
arXiv Detail & Related papers (2022-11-17T18:59:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.