SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLMs via Continual Pre-Training, Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding
- URL: http://arxiv.org/abs/2505.04723v1
- Date: Wed, 07 May 2025 18:21:47 GMT
- Title: SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLMs via Continual Pre-Training, Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding
- Authors: Jingyang Deng, Ran Chen, Jo-Ku Cheng, Jinwen Ma,
- Abstract summary: This study addresses key challenges in developing domain-specific large language models (LLMs) for Chinese state-owned assets and enterprises (SOAEs)<n>Our work introduces a comprehensive, full-pipeline approach for optimizing SOAEs LLMs, bridging the gap between general language capabilities and domain-specific expertise.
- Score: 10.38247266103905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study addresses key challenges in developing domain-specific large language models (LLMs) for Chinese state-owned assets and enterprises (SOAEs), where current approaches face three limitations: 1) constrained model capacity that limits knowledge integration and cross-task adaptability; 2) excessive reliance on domain-specific supervised fine-tuning (SFT) data, which neglects the broader applicability of general language patterns; and 3) inefficient inference acceleration for large models processing long contexts. In this work, we propose SOAEsV2-7B/72B, a specialized LLM series developed via a three-phase framework: 1) continual pre-training integrates domain knowledge while retaining base capabilities; 2) domain-progressive SFT employs curriculum-based learning strategy, transitioning from weakly relevant conversational data to expert-annotated SOAEs datasets to optimize domain-specific tasks; 3) distillation-enhanced speculative decoding accelerates inference via logit distillation between 72B target and 7B draft models, achieving 1.39-1.52$\times$ speedup without quality loss. Experimental results demonstrate that our domain-specific pre-training phase maintains 99.8% of original general language capabilities while significantly improving domain performance, resulting in a 1.08$\times$ improvement in Rouge-1 score and a 1.17$\times$ enhancement in BLEU-4 score. Ablation studies further show that domain-progressive SFT outperforms single-stage training, achieving 1.02$\times$ improvement in Rouge-1 and 1.06$\times$ in BLEU-4. Our work introduces a comprehensive, full-pipeline approach for optimizing SOAEs LLMs, bridging the gap between general language capabilities and domain-specific expertise.
Related papers
- CLGRPO: Reasoning Ability Enhancement for Small VLMs [4.551310348498266]
Small Vision Language Models (SVLMs) generally refer to models with parameter sizes less than or equal to 2B.<n>This paper proposes a post-training optimization paradigm called the Incremental Training Strategy to enhance the reasoning ability of SVLMs.<n> Experimental results demonstrate that our method significantly improves the reasoning ability of 1B SVLM.
arXiv Detail & Related papers (2025-06-22T14:32:15Z) - MoL for LLMs: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities [0.0]
We propose a novel framework, Mixture of Losses (MoL), which decouples optimization objectives for domain-specific and general corpora.<n>Specifically, cross-entropy (CE) loss is applied to domain-corpus to ensure knowledge acquisition, while Kullback-Leibler (KL) divergence aligns general-corpus training with the base model's foundational capabilities.
arXiv Detail & Related papers (2025-05-17T15:12:47Z) - Can LLMs handle WebShell detection? Overcoming Detection Challenges with Behavioral Function-Aware Framework [11.613261852608062]
WebShell attacks, in which malicious scripts are injected into web servers, are a major cybersecurity threat.<n>This work is the first to explore the feasibility and limitations of Large Language Models for WebShell detection.
arXiv Detail & Related papers (2025-04-14T21:09:37Z) - FAIT: Fault-Aware Fine-Tuning for Better Code Generation [11.8755180563981]
We propose Fault-Aware Fine-Tuning (FAIT) to enhance instruction-tuned large language models' code generation.<n>Our method achieves an average relative improvement of 6.9% on pass@1 with just one epoch of training.<n>Some enhanced 6.7B LLMs outperforming closed-source models, e.g., GPT-3.5-Turbo.
arXiv Detail & Related papers (2025-03-21T07:23:26Z) - Fragile Mastery: Are Domain-Specific Trade-Offs Undermining On-Device Language Models? [0.0]
Generalized Edge Model (GEM) aims to balance robustness and generalization in a harmonious manner.<n>GEM employs a Sparse Cross-Attention Router (SCAR) to dynamically allocate to a variable number of computing resources.<n>Compared to GPT-4 Lite, GEM enhances the general-task level by 7% with respect and parity in domain-specific performance.
arXiv Detail & Related papers (2025-03-16T18:30:26Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning [55.107329995417786]
Large language models (LLMs) have demonstrated impressive general understanding and generation abilities.
We establish a benchmark for multi-domain translation, featuring 25 German$Leftrightarrow$English and 22 Chinese$Leftrightarrow$English test sets.
We propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance.
arXiv Detail & Related papers (2024-10-03T16:15:04Z) - Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance -- A Case Study in Finance [0.32985979395737774]
We present a detailed analysis of fine-tuning large language models (LLMs) for domain-specific tasks.<n>We find that in domain-specific cases, fine-tuning exclusively on the target task is not always the most effective strategy.<n>We demonstrate how this approach enables a small model, such as Phi-3-Mini, to achieve state-of-the-art results.
arXiv Detail & Related papers (2024-10-01T22:35:56Z) - On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability [59.72892401927283]
We evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks.
Our results reveal that o1-preview outperforms GPT-4 in adhering to task constraints.
arXiv Detail & Related papers (2024-09-30T03:58:43Z) - Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training.
We propose three effective strategies to enhance LLM performance within a fixed compute budget.
Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z) - BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models [56.89958793648104]
Large Language Models (LLMs) are versatile and capable of addressing a diverse range of tasks.
Previous approaches either conduct continuous pre-training with domain-specific data or employ retrieval augmentation to support general LLMs.
We present a novel framework named BLADE, which enhances Black-box LArge language models with small Domain-spEcific models.
arXiv Detail & Related papers (2024-03-27T08:57:21Z) - DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning [56.887047551101574]
We present DS-Agent, a novel framework that harnesses large language models (LLMs) agent and case-based reasoning (CBR)
In the development stage, DS-Agent follows the CBR framework to structure an automatic iteration pipeline, which can flexibly capitalize on the expert knowledge from Kaggle.
In the deployment stage, DS-Agent implements a low-resource deployment stage with a simplified CBR paradigm, significantly reducing the demand on foundational capabilities of LLMs.
arXiv Detail & Related papers (2024-02-27T12:26:07Z) - Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning [50.9692060692705]
This paper introduces $textbfLanguage Models for $textbfMo$tion Control ($textbfLaMo$), a general framework based on Decision Transformers for offline RL.<n>Our framework highlights four crucial components:.<n>Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method,.<n>In particular, our method demonstrates superior performance in scenarios with limited data samples.
arXiv Detail & Related papers (2023-10-31T16:24:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.