Related papers: SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLMs via Continual Pre-Training, Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding

SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLMs via Continual Pre-Training, Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding

URL: http://arxiv.org/abs/2505.04723v1
Date: Wed, 07 May 2025 18:21:47 GMT
Title: SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLMs via Continual Pre-Training, Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding
Authors: Jingyang Deng, Ran Chen, Jo-Ku Cheng, Jinwen Ma,
Abstract summary: This study addresses key challenges in developing domain-specific large language models (LLMs) for Chinese state-owned assets and enterprises (SOAEs)<n>Our work introduces a comprehensive, full-pipeline approach for optimizing SOAEs LLMs, bridging the gap between general language capabilities and domain-specific expertise.
Score: 10.38247266103905
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study addresses key challenges in developing domain-specific large language models (LLMs) for Chinese state-owned assets and enterprises (SOAEs), where current approaches face three limitations: 1) constrained model capacity that limits knowledge integration and cross-task adaptability; 2) excessive reliance on domain-specific supervised fine-tuning (SFT) data, which neglects the broader applicability of general language patterns; and 3) inefficient inference acceleration for large models processing long contexts. In this work, we propose SOAEsV2-7B/72B, a specialized LLM series developed via a three-phase framework: 1) continual pre-training integrates domain knowledge while retaining base capabilities; 2) domain-progressive SFT employs curriculum-based learning strategy, transitioning from weakly relevant conversational data to expert-annotated SOAEs datasets to optimize domain-specific tasks; 3) distillation-enhanced speculative decoding accelerates inference via logit distillation between 72B target and 7B draft models, achieving 1.39-1.52$\times$ speedup without quality loss. Experimental results demonstrate that our domain-specific pre-training phase maintains 99.8% of original general language capabilities while significantly improving domain performance, resulting in a 1.08$\times$ improvement in Rouge-1 score and a 1.17$\times$ enhancement in BLEU-4 score. Ablation studies further show that domain-progressive SFT outperforms single-stage training, achieving 1.02$\times$ improvement in Rouge-1 and 1.06$\times$ in BLEU-4. Our work introduces a comprehensive, full-pipeline approach for optimizing SOAEs LLMs, bridging the gap between general language capabilities and domain-specific expertise.

Related papers

CLGRPO: Reasoning Ability Enhancement for Small VLMs [4.551310348498266]
Small Vision Language Models (SVLMs) generally refer to models with parameter sizes less than or equal to 2B.<n>This paper proposes a post-training optimization paradigm called the Incremental Training Strategy to enhance the reasoning ability of SVLMs.<n> Experimental results demonstrate that our method significantly improves the reasoning ability of 1B SVLM.
arXiv Detail & Related papers (2025-06-22T14:32:15Z)
MoL for LLMs: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities [0.0]
We propose a novel framework, Mixture of Losses (MoL), which decouples optimization objectives for domain-specific and general corpora.<n>Specifically, cross-entropy (CE) loss is applied to domain-corpus to ensure knowledge acquisition, while Kullback-Leibler (KL) divergence aligns general-corpus training with the base model's foundational capabilities.
arXiv Detail & Related papers (2025-05-17T15:12:47Z)
Can LLMs handle WebShell detection? Overcoming Detection Challenges with Behavioral Function-Aware Framework [11.613261852608062]
WebShell attacks, in which malicious scripts are injected into web servers, are a major cybersecurity threat.<n>This work is the first to explore the feasibility and limitations of Large Language Models for WebShell detection.
arXiv Detail & Related papers (2025-04-14T21:09:37Z)
FAIT: Fault-Aware Fine-Tuning for Better Code Generation [11.8755180563981]
We propose Fault-Aware Fine-Tuning (FAIT) to enhance instruction-tuned large language models' code generation.<n>Our method achieves an average relative improvement of 6.9% on pass@1 with just one epoch of training.<n>Some enhanced 6.7B LLMs outperforming closed-source models, e.g., GPT-3.5-Turbo.
arXiv Detail & Related papers (2025-03-21T07:23:26Z)
Fragile Mastery: Are Domain-Specific Trade-Offs Undermining On-Device Language Models? [0.0]
Generalized Edge Model (GEM) aims to balance robustness and generalization in a harmonious manner.<n>GEM employs a Sparse Cross-Attention Router (SCAR) to dynamically allocate to a variable number of computing resources.<n>Compared to GPT-4 Lite, GEM enhances the general-task level by 7% with respect and parity in domain-specific performance.
arXiv Detail & Related papers (2025-03-16T18:30:26Z)
Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z)
Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning [55.107329995417786]
Large language models (LLMs) have demonstrated impressive general understanding and generation abilities. We establish a benchmark for multi-domain translation, featuring 25 German$Leftrightarrow$English and 22 Chinese$Leftrightarrow$English test sets. We propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance.
arXiv Detail & Related papers (2024-10-03T16:15:04Z)
Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance -- A Case Study in Finance [0.32985979395737774]
We present a detailed analysis of fine-tuning large language models (LLMs) for domain-specific tasks.<n>We find that in domain-specific cases, fine-tuning exclusively on the target task is not always the most effective strategy.<n>We demonstrate how this approach enables a small model, such as Phi-3-Mini, to achieve state-of-the-art results.
arXiv Detail & Related papers (2024-10-01T22:35:56Z)
On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability [59.72892401927283]
We evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks. Our results reveal that o1-preview outperforms GPT-4 in adhering to task constraints.
arXiv Detail & Related papers (2024-09-30T03:58:43Z)
Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training. We propose three effective strategies to enhance LLM performance within a fixed compute budget. Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z)
BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models [56.89958793648104]
Large Language Models (LLMs) are versatile and capable of addressing a diverse range of tasks. Previous approaches either conduct continuous pre-training with domain-specific data or employ retrieval augmentation to support general LLMs. We present a novel framework named BLADE, which enhances Black-box LArge language models with small Domain-spEcific models.
arXiv Detail & Related papers (2024-03-27T08:57:21Z)
DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning [56.887047551101574]
We present DS-Agent, a novel framework that harnesses large language models (LLMs) agent and case-based reasoning (CBR) In the development stage, DS-Agent follows the CBR framework to structure an automatic iteration pipeline, which can flexibly capitalize on the expert knowledge from Kaggle. In the deployment stage, DS-Agent implements a low-resource deployment stage with a simplified CBR paradigm, significantly reducing the demand on foundational capabilities of LLMs.
arXiv Detail & Related papers (2024-02-27T12:26:07Z)
Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning [50.9692060692705]
This paper introduces $textbfLanguage Models for $textbfMo$tion Control ($textbfLaMo$), a general framework based on Decision Transformers for offline RL.<n>Our framework highlights four crucial components:.<n>Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method,.<n>In particular, our method demonstrates superior performance in scenarios with limited data samples.
arXiv Detail & Related papers (2023-10-31T16:24:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.