Related papers: FineScope : Precision Pruning for Domain-Specialized Large Language Models Using SAE-Guided Self-Data Cultivation

FineScope : Precision Pruning for Domain-Specialized Large Language Models Using SAE-Guided Self-Data Cultivation

URL: http://arxiv.org/abs/2505.00624v1
Date: Thu, 01 May 2025 16:05:08 GMT
Title: FineScope : Precision Pruning for Domain-Specialized Large Language Models Using SAE-Guided Self-Data Cultivation
Authors: Chaitali Bhattacharyya, Yeseong Kim,
Abstract summary: FineScope is a framework for deriving domain-optimized language models from larger pretrained models.<n>We apply structured pruning with domain-specific constraints, ensuring that the resulting models retain essential knowledge for the target domain.<n>Experiments and ablation studies demonstrate that FineScope achieves highly competitive performance.
Score: 1.8816124486165122
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training large language models (LLMs) from scratch requires significant computational resources, driving interest in developing smaller, domain-specific LLMs that maintain both efficiency and strong task performance. Medium-sized models such as LLaMA, llama} have served as starting points for domain-specific adaptation, but they often suffer from accuracy degradation when tested on specialized datasets. We introduce FineScope, a framework for deriving compact, domain-optimized LLMs from larger pretrained models. FineScope leverages the Sparse Autoencoder (SAE) framework, inspired by its ability to produce interpretable feature representations, to extract domain-specific subsets from large datasets. We apply structured pruning with domain-specific constraints, ensuring that the resulting pruned models retain essential knowledge for the target domain. To further enhance performance, these pruned models undergo self-data distillation, leveraging SAE-curated datasets to restore key domain-specific information lost during pruning. Extensive experiments and ablation studies demonstrate that FineScope achieves highly competitive performance, outperforming several large-scale state-of-the-art LLMs in domain-specific tasks. Additionally, our results show that FineScope enables pruned models to regain a substantial portion of their original performance when fine-tuned with SAE-curated datasets. Furthermore, applying these datasets to fine-tune pretrained LLMs without pruning also improves their domain-specific accuracy, highlighting the robustness of our approach. The code will be released.

Related papers

Building Domain-Specific Small Language Models via Guided Data Generation [5.404790079646315]
Large Language Models (LLMs) have shown remarkable success in supporting a wide range of knowledge-intensive tasks.<n>In specialized domains, there is growing interest in leveraging LLMs to assist subject matter experts with domain-specific challenges.<n>Many open-source models demand significant computational resources for effective domain adaptation and deployment.<n>We present a cost-efficient and scalable training pipeline that combines guided synthetic data generation from a small seed corpus with bottom-up domain data.
arXiv Detail & Related papers (2025-11-23T07:19:31Z)
Learn More, Forget Less: A Gradient-Aware Data Selection Approach for LLM [51.21051698747157]
We propose a self-adaptive gradient-aware data selection approach (GrADS) for supervised fine-tuning of large language models (LLMs)<n>Specifically, we design self-guided criteria that leverage the magnitude and statistical distribution of gradients to prioritize examples that contribute the most to the model's learning process.<n>Through extensive experimentation with various LLMs across diverse domains such as medicine, law, and finance, GrADS has demonstrated significant efficiency and cost-effectiveness.
arXiv Detail & Related papers (2025-11-07T08:34:50Z)
DIDS: Domain Impact-aware Data Sampling for Large Language Model Training [61.10643823069603]
We present Domain Impact-aware Data Sampling (DIDS) for large language models.<n>DIDS group training data based on learning effects, where a proxy language model and dimensionality reduction are employed.<n>It achieves 3.4% higher average performance while maintaining comparable training efficiency.
arXiv Detail & Related papers (2025-04-17T13:09:38Z)
Enhancing Domain-Specific Encoder Models with LLM-Generated Data: How to Leverage Ontologies, and How to Do Without Them [9.952432291248954]
We investigate the use of LLM-generated data for continual pretraining of encoder models in domains with limited data.<n>We compile a benchmark specifically designed for assessing embedding model performance in invasion biology.<n>Our results demonstrate that this approach achieves a fully automated pipeline for enhancing domain-specific understanding of small encoder models.
arXiv Detail & Related papers (2025-03-27T21:51:24Z)
LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression.<n>LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model.<n>Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z)
Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning [61.99353167168545]
We show that fine-tuning with LLM-generated data improves target task performance and reduces non-target task degradation.<n>This is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning.
arXiv Detail & Related papers (2025-01-24T08:18:56Z)
Deriving Coding-Specific Sub-Models from LLMs using Resource-Efficient Pruning [4.762390044282733]
Large Language Models (LLMs) have demonstrated their exceptional performance in various complex code generation tasks.<n>To mitigate such requirements, model pruning techniques are used to create more compact models with significantly fewer parameters.<n>In this work, we explore the idea of efficiently deriving coding-specific sub-models through unstructured pruning.
arXiv Detail & Related papers (2025-01-09T14:00:01Z)
On Domain-Adaptive Post-Training for Multimodal Large Language Models [72.67107077850939]
This paper systematically investigates domain adaptation of MLLMs via post-training.<n>We focus on data synthesis, training pipeline, and task evaluation.<n>We conduct experiments in high-impact domains such as biomedicine, food, and remote sensing.
arXiv Detail & Related papers (2024-11-29T18:42:28Z)
Exploring Language Model Generalization in Low-Resource Extractive QA [57.14068405860034]
We investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift.<n>We devise a series of experiments to explain the performance gap empirically.
arXiv Detail & Related papers (2024-09-27T05:06:43Z)
Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models [0.8399688944263842]
Large Language Models (LLMs) have the capability to understand and generate human-like text from input queries. This study extends this concept to the integration of LLMs within Retrieval-Augmented Generation (RAG) pipelines. We evaluate the impact of fine-tuning on the LLMs' capacity for data extraction and contextual understanding.
arXiv Detail & Related papers (2024-06-17T04:35:17Z)
Pruning as a Domain-specific LLM Extractor [44.81262364608468]
Large Language Models (LLMs) have exhibited remarkable proficiency across a wide array of NLP tasks. Few efforts have explored model pruning techniques to reduce the size of LLMs. This work introduces an innovative unstructured dual-pruning methodology, D-Pruner, for domain-specific compression on LLM.
arXiv Detail & Related papers (2024-05-10T07:05:02Z)
BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models [56.89958793648104]
Large Language Models (LLMs) are versatile and capable of addressing a diverse range of tasks. Previous approaches either conduct continuous pre-training with domain-specific data or employ retrieval augmentation to support general LLMs. We present a novel framework named BLADE, which enhances Black-box LArge language models with small Domain-spEcific models.
arXiv Detail & Related papers (2024-03-27T08:57:21Z)
Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases [9.478012553728538]
We propose an end-to-end system design towards utilizing Retrieval Augmented Generation (RAG) to improve the factual accuracy of Large Language Models (LLMs) Our system integrates RAG pipeline with upstream datasets processing and downstream performance evaluation. Our experiments demonstrate the system's effectiveness in generating more accurate answers to domain-specific and time-sensitive inquiries.
arXiv Detail & Related papers (2024-03-15T16:30:14Z)
PANDA: Preference Adaptation for Enhancing Domain-Specific Abilities of LLMs [49.32067576992511]
Large language models often fall short of the performance achieved by domain-specific state-of-the-art models. One potential approach to enhance domain-specific capabilities of LLMs involves fine-tuning them using corresponding datasets. We propose Preference Adaptation for Enhancing Domain-specific Abilities of LLMs (PANDA) Our experimental results reveal that PANDA significantly enhances the domain-specific ability of LLMs on text classification and interactive decision tasks.
arXiv Detail & Related papers (2024-02-20T09:02:55Z)
Generalized Semantic Segmentation by Self-Supervised Source Domain Projection and Multi-Level Contrastive Learning [79.0660895390689]
Deep networks trained on the source domain show degraded performance when tested on unseen target domain data. We propose a Domain Projection and Contrastive Learning (DPCL) approach for generalized semantic segmentation.
arXiv Detail & Related papers (2023-03-03T13:07:14Z)
CHALLENGER: Training with Attribution Maps [63.736435657236505]
We show that utilizing attribution maps for training neural networks can improve regularization of models and thus increase performance. In particular, we show that our generic domain-independent approach yields state-of-the-art results in vision, natural language processing and on time series tasks.
arXiv Detail & Related papers (2022-05-30T13:34:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.