MoL for LLMs: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities
- URL: http://arxiv.org/abs/2505.12043v2
- Date: Tue, 20 May 2025 02:37:36 GMT
- Title: MoL for LLMs: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities
- Authors: Jingxue Chen, Qingkun Tang, Qianchun Lu, Siyuan Fang,
- Abstract summary: We propose a novel framework, Mixture of Losses (MoL), which decouples optimization objectives for domain-specific and general corpora.<n>Specifically, cross-entropy (CE) loss is applied to domain-corpus to ensure knowledge acquisition, while Kullback-Leibler (KL) divergence aligns general-corpus training with the base model's foundational capabilities.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although large language models (LLMs) perform well in general tasks, domain-specific applications suffer from hallucinations and accuracy limitations. Continual Pre-Training (CPT) approaches encounter two key issues: (1) domain-biased data degrades general language skills, and (2) improper corpus-mixture ratios limit effective adaptation. To address these, we propose a novel framework, Mixture of Losses (MoL), which decouples optimization objectives for domain-specific and general corpora. Specifically, cross-entropy (CE) loss is applied to domain-corpus to ensure knowledge acquisition, while Kullback-Leibler (KL) divergence aligns general-corpus training with the base model's foundational capabilities. This dual-loss architecture preserves universal skills while enhancing domain expertise, avoiding catastrophic forgetting. Empirically, we validate that a 1:1 domain-to-general corpus ratio optimally balances training and overfitting without the need for extensive tuning or resource-intensive experiments. Furthermore, our experiments demonstrate significant performance gains compared to traditional CPT approaches, which often suffer from degradation in general language capabilities; our model achieves 27.9% higher accuracy on the Math-500 benchmark in the non-think reasoning mode, and an impressive 83.3% improvement on the challenging AIME25 subset in the think mode, underscoring the effectiveness of our approach.
Related papers
- NDCG-Consistent Softmax Approximation with Accelerated Convergence [67.10365329542365]
We propose novel loss formulations that align directly with ranking metrics.<n>We integrate the proposed RG losses with the highly efficient Alternating Least Squares (ALS) optimization method.<n> Empirical evaluations on real-world datasets demonstrate that our approach achieves comparable or superior ranking performance.
arXiv Detail & Related papers (2025-06-11T06:59:17Z) - Dual Decomposition of Weights and Singular Value Low Rank Adaptation [9.048461365342204]
We propose DuDe, a novel approach that decomposes weight matrices into magnitude and direction components.<n>Our evaluation demonstrates DuDe's superior performance and robustness, achieving up to 48.35% accuracy on MMLU and 62.53% ($pm$ 1.59) accuracy on GSM8K.
arXiv Detail & Related papers (2025-05-20T13:49:15Z) - Unified Enhancement of the Generalization and Robustness of Language Models via Bi-Stage Optimization [2.502393972789905]
We propose a bi-stage optimization framework to uniformly enhance both the generalization and robustness of LMs.<n>We show that our method significantly improves the generalization and robustness of LMs compared to other existing methods.
arXiv Detail & Related papers (2025-03-19T13:50:36Z) - How to Alleviate Catastrophic Forgetting in LLMs Finetuning? Hierarchical Layer-Wise and Element-Wise Regularization [15.434072331989878]
Large Language Models (LLMs) exhibit strong general language capabilities.<n>Fine-tuning these models on domain-specific tasks often leads to catastrophic forgetting, where the model overwrites or loses essential knowledge acquired during pretraining.<n>We propose a novel approach to compute the element-wise importance of model parameters crucial for preserving general knowledge during fine-tuning.
arXiv Detail & Related papers (2025-01-23T13:54:53Z) - Federated Fine-Tuning of LLMs: Framework Comparison and Research Directions [59.5243730853157]
Federated learning (FL) provides a privacy-preserving solution for fine-tuning pre-trained large language models (LLMs) using distributed private datasets.<n>This article conducts a comparative analysis of three advanced federated LLM (FedLLM) frameworks that integrate knowledge distillation (KD) and split learning (SL) to mitigate these issues.
arXiv Detail & Related papers (2025-01-08T11:37:06Z) - SoMA: Singular Value Decomposed Minor Components Adaptation for Domain Generalizable Representation Learning [6.262268096839562]
Domain generalization aims to adapt a model using one or multiple source domains to ensure robust performance in unseen target domains.<n>Existing PEFT methods struggle to strike a balance between preserving generalizable components of the pre-trained model and learning task-specific features.<n>We introduce Singular Value De Minor Components Adaptation (SoMA), an approach that selectively tunes minor singular components while keeping the residual parts frozen.
arXiv Detail & Related papers (2024-12-05T11:17:57Z) - CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models [9.661578977988743]
Large Language Models (LLMs) excel in diverse tasks but often underperform in specialized fields due to limited domain-specific or proprietary corpus.
The data mixture ratio of general corpus and domain-specific corpus, however, has been chosen forgettingally, leading to sub-optimal training efficiency in practice.
We formalize the trade-off between general and domain-specific capabilities, leading to a well-defined Critical Mixture Ratio (CMR) of general and domain data.
arXiv Detail & Related papers (2024-07-24T17:59:02Z) - Semi-Federated Learning: Convergence Analysis and Optimization of A
Hybrid Learning Framework [70.83511997272457]
We propose a semi-federated learning (SemiFL) paradigm to leverage both the base station (BS) and devices for a hybrid implementation of centralized learning (CL) and FL.
We propose a two-stage algorithm to solve this intractable problem, in which we provide the closed-form solutions to the beamformers.
arXiv Detail & Related papers (2023-10-04T03:32:39Z) - Domain Adaptation with Adversarial Training on Penultimate Activations [82.9977759320565]
Enhancing model prediction confidence on unlabeled target data is an important objective in Unsupervised Domain Adaptation (UDA)
We show that this strategy is more efficient and better correlated with the objective of boosting prediction confidence than adversarial training on input images or intermediate features.
arXiv Detail & Related papers (2022-08-26T19:50:46Z) - Optimizing Two-way Partial AUC with an End-to-end Framework [154.47590401735323]
Area Under the ROC Curve (AUC) is a crucial metric for machine learning.
Recent work shows that the TPAUC is essentially inconsistent with the existing Partial AUC metrics.
We present the first trial in this paper to optimize this new metric.
arXiv Detail & Related papers (2022-06-23T12:21:30Z) - Disentangled Federated Learning for Tackling Attributes Skew via
Invariant Aggregation and Diversity Transferring [104.19414150171472]
Attributes skews the current federated learning (FL) frameworks from consistent optimization directions among the clients.
We propose disentangled federated learning (DFL) to disentangle the domain-specific and cross-invariant attributes into two complementary branches.
Experiments verify that DFL facilitates FL with higher performance, better interpretability, and faster convergence rate, compared with SOTA FL methods.
arXiv Detail & Related papers (2022-06-14T13:12:12Z) - Latent-Optimized Adversarial Neural Transfer for Sarcasm Detection [50.29565896287595]
We apply transfer learning to exploit common datasets for sarcasm detection.
We propose a generalized latent optimization strategy that allows different losses to accommodate each other.
In particular, we achieve 10.02% absolute performance gain over the previous state of the art on the iSarcasm dataset.
arXiv Detail & Related papers (2021-04-19T13:07:52Z) - Towards Fair Knowledge Transfer for Imbalanced Domain Adaptation [61.317911756566126]
We propose a Towards Fair Knowledge Transfer framework to handle the fairness challenge in imbalanced cross-domain learning.
Specifically, a novel cross-domain mixup generation is exploited to augment the minority source set with target information to enhance fairness.
Our model significantly improves over 20% on two benchmarks in terms of the overall accuracy.
arXiv Detail & Related papers (2020-10-23T06:29:09Z) - Learning to Learn Single Domain Generalization [18.72451358284104]
We propose a new method named adversarial domain augmentation to solve this Out-of-Distribution (OOD) generalization problem.
The key idea is to leverage adversarial training to create "fictitious" yet "challenging" populations.
To facilitate fast and desirable domain augmentation, we cast the model training in a meta-learning scheme and use a Wasserstein Auto-Encoder (WAE) to relax the widely used worst-case constraint.
arXiv Detail & Related papers (2020-03-30T04:39:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.