Related papers: $\textit{SKIntern}$: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models

$\textit{SKIntern}$: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models

URL: http://arxiv.org/abs/2409.13183v1
Date: Fri, 20 Sep 2024 03:23:20 GMT
Title: $\textit{SKIntern}$: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models
Authors: Huanxuan Liao, Shizhu He, Yupu Hao, Xiang Li, Yuanzhe Zhang, Kang Liu, Jun Zhao,
Abstract summary: Small Language Models (SLMs) are attracting attention due to the high computational demands and privacy concerns. We introduce $textitSKIntern$, an innovative approach that empowers SLMs to internalize symbolic knowledge.
Score: 27.07695214182334
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Small Language Models (SLMs) are attracting attention due to the high computational demands and privacy concerns of Large Language Models (LLMs). Some studies fine-tune SLMs using Chains of Thought (CoT) data distilled from LLMs, aiming to enhance their reasoning ability. Furthermore, Some CoT distillation methods introduce external symbolic knowledge into the generation process to improve the limited knowledge memory, reasoning ability and out-of-domain (OOD) generalization of SLMs. However, the introduction of symbolic knowledge increases computational overhead and introduces potential noise. In this paper, we introduce $\textit{SKIntern}$, an innovative approach that empowers SLMs to internalize symbolic knowledge and few-shot examples gradually through a progressive fine-tuning process, guided by a predefined linear decay schedule under curriculum learning. By efficiently internalizing knowledge, $\textit{SKIntern}$ reduces computational overhead and speeds up the reasoning process by focusing solely on the question during inference. It outperforms state-of-the-art baselines by over 5\%, while reducing inference costs (measured in FLOPs) by up to $4\times$ across a wide range of SLMs in both in-domain (ID) and out-of-domain (OOD) tasks. Our code will be available at \url{https://github.com/Xnhyacinth/SKIntern}.

Related papers

Cognitive Load-Aware Inference: A Neuro-Symbolic Framework for Optimizing the Token Economy of Large Language Models [0.9790236766474201]
This paper introduces the Cognitive Load-Aware Inference (CLAI) framework, which operationalizes principles from Cognitive Load Theory (CLT) and neuroscience for Large Language Model (LLM) inference.<n>We formalize the concepts of Intrinsic Cognitive Load, Extraneous Cognitive Load, and Germane Cognitive Load into quantifiable LLM metrics.<n>We propose two implementation paths: CLAI-Prompt, a zero-shot method that guides a base LLM through cognitive control steps via a structured meta-prompt, and CLAI-Tune, a fine-tuned model that internalizes these principles for spontaneous cognitive economy.
arXiv Detail & Related papers (2025-07-01T10:51:18Z)
Reinforced Latent Reasoning for LLM-based Recommendation [83.18146814163308]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z)
General Reasoning Requires Learning to Reason from the Get-go [19.90997698310839]
Large Language Models (LLMs) have demonstrated impressive real-world utility. But their ability to reason adaptively and robustly remains fragile. We propose disangling knowledge and reasoning through three key directions.
arXiv Detail & Related papers (2025-02-26T18:51:12Z)
END: Early Noise Dropping for Efficient and Effective Context Denoising [60.24648712022382]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. They are often distracted by irrelevant or noisy context in input sequences that degrades output quality. We introduce Early Noise Dropping (textscEND), a novel approach to mitigate this issue without requiring fine-tuning the LLMs.
arXiv Detail & Related papers (2025-02-26T08:07:17Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning [79.48839334040197]
Instruction fine-tuning (IFT) can increase the informativeness of large language models (LLMs), but may reduce their truthfulness.<n>In this paper, we empirically demonstrate how unfamiliar knowledge in IFT datasets can negatively affect the truthfulness of LLMs.<n>We introduce two new IFT paradigms, $UNIT_cut$ and $UNIT_ref$, to address this issue.
arXiv Detail & Related papers (2025-02-17T16:10:30Z)
Sample-Efficient Reinforcement Learning from Human Feedback via Information-Directed Sampling [46.035795210898414]
We study the problem of reinforcement learning from human feedback (RLHF), a critical problem in training large language models. Our main contribution is the design of novel sample-efficient RLHF algorithms based on information-directed sampling (IDS) Our work showcases the value of information theory in reinforcement learning and in the training of large language models.
arXiv Detail & Related papers (2025-02-08T03:47:00Z)
Rational Metareasoning for Large Language Models [17.479428400594028]
Being prompted to engage in reasoning has emerged as a core technique for using large language models (LLMs)<n>This work introduces a novel approach based on computational models of metareasoning used in cognitive science.<n>We develop a reward function that incorporates the Value of Computation by penalizing unnecessary reasoning.
arXiv Detail & Related papers (2024-10-07T23:48:52Z)
Neural-Symbolic Collaborative Distillation: Advancing Small Language Models for Complex Reasoning Tasks [30.572064185770298]
We propose a novel knowledge distillation method for learning the complex reasoning abilities of Large Language Models (LLMs) NesyCD distills the general capabilities and specialized knowledge in LLMs using different manners. Our experiments show that NesyCD significantly boosts SLMs' complex reasoning performance on in-domain (BBH, GSM8K) and out-of-domain (AGIEval, ARC) datasets.
arXiv Detail & Related papers (2024-09-20T04:17:13Z)
Great Memory, Shallow Reasoning: Limits of $k$NN-LMs [71.73611113995143]
$k$NN-LMs, which integrate retrieval with next-word prediction, have demonstrated strong performance in language modeling. We ask whether this improved ability to recall information really translates into downstream abilities.
arXiv Detail & Related papers (2024-08-21T17:59:05Z)
Soft Prompting for Unlearning in Large Language Models [11.504012974208466]
This work focuses on investigating machine unlearning for Large Language Models motivated by data protection regulations. We propose a framework textbfSoft textbfPrompting for textbfUntextbflearning (SPUL) We conduct a rigorous evaluation of the proposed method and our results indicate that SPUL can significantly improve the trade-off between utility and forgetting.
arXiv Detail & Related papers (2024-06-17T19:11:40Z)
Refiner: Restructure Retrieval Content Efficiently to Advance Question-Answering Capabilities [30.1331670544648]
Large Language Models (LLMs) are limited by their parametric knowledge, leading to hallucinations in knowledge-extensive tasks. We propose $textitRefiner$, an end-to-end extract-and-restructure paradigm that operates in the post-retrieval process of RAG.
arXiv Detail & Related papers (2024-06-17T09:25:10Z)
A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models. HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length. We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z)
LLoCO: Learning Long Contexts Offline [63.3458260335454]
We propose LLoCO, a novel approach to processing long contexts. LLoCO learns contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA. Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens.
arXiv Detail & Related papers (2024-04-11T17:57:22Z)
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z)
Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners. We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting. Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z)
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers [29.319666323947708]
We present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness. Our method employs a learnable mechanism that determines which uninformative tokens can be dropped from the context. Our reference implementation achieves up to $2times$ increase in inference throughput and even greater memory savings.
arXiv Detail & Related papers (2023-05-25T07:39:41Z)
OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs. Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z)
Learning to Ask Conversational Questions by Optimizing Levenshtein Distance [83.53855889592734]
We introduce a Reinforcement Iterative Sequence Editing (RISE) framework that optimize the minimum Levenshtein distance (MLD) through explicit editing actions. RISE is able to pay attention to tokens that are related to conversational characteristics. Experimental results on two benchmark datasets show that RISE significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-06-30T08:44:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.