MCC-KD: Multi-CoT Consistent Knowledge Distillation
- URL: http://arxiv.org/abs/2310.14747v3
- Date: Wed, 20 Dec 2023 06:50:20 GMT
- Title: MCC-KD: Multi-CoT Consistent Knowledge Distillation
- Authors: Hongzhan Chen, Siyue Wu, Xiaojun Quan, Rui Wang, Ming Yan, Ji Zhang
- Abstract summary: We propose Multi-CoT Consistent Knowledge Distillation (MCC-KD) to efficiently distill the reasoning capabilities.
In MCC-KD, we generate multiple rationales for each question and enforce consistency among the corresponding predictions.
We investigate the effectiveness of MCC-KD with different model architectures and various model scales.
- Score: 39.327560600207626
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have showcased remarkable capabilities in
complex reasoning through chain of thought (CoT) prompting. Recently, there has
been a growing interest in transferring these reasoning abilities from LLMs to
smaller models. However, achieving both the diversity and consistency in
rationales presents a challenge. In this paper, we focus on enhancing these two
aspects and propose Multi-CoT Consistent Knowledge Distillation (MCC-KD) to
efficiently distill the reasoning capabilities. In MCC-KD, we generate multiple
rationales for each question and enforce consistency among the corresponding
predictions by minimizing the bidirectional KL-divergence between the answer
distributions. We investigate the effectiveness of MCC-KD with different model
architectures (LLaMA/FlanT5) and various model scales (3B/7B/11B/13B) on both
mathematical reasoning and commonsense reasoning benchmarks. The empirical
results not only confirm MCC-KD's superior performance on in-distribution
datasets but also highlight its robust generalization ability on
out-of-distribution datasets.
Related papers
- Chain-of-Thought Prompting for Out-of-Distribution Samples: A Latent-Variable Study [5.236910203359897]
Chain-of-Thought (CoT) prompting has emerged as a powerful technique to improve in-context learning in large language models.
We extend a latent-variable framework for CoT prompting and study its behavior on two prototypical out-of-distribution (OOD) scenarios.
Our experiments demonstrate that CoT inference generalizes effectively to OOD samples whose latent variables closely resemble those seen during training, but its performance degrades as this similarity decreases.
arXiv Detail & Related papers (2025-04-17T14:59:29Z) - MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [63.23935582919081]
Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs)
We introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs.
We conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights.
arXiv Detail & Related papers (2025-02-13T18:59:46Z) - Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs)
We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs.
We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z) - Mentor-KD: Making Small Language Models Better Multi-step Reasoners [15.159415340059388]
We propose Mentor-KD, which effectively distills the multi-step reasoning capability of LLMs to smaller LMs.
We exploit a mentor, intermediate-sized task-specific fine-tuned model, to augment additional CoT annotations.
We conduct extensive experiments and confirm Mentor-KD's effectiveness across various models and complex reasoning tasks.
arXiv Detail & Related papers (2024-10-11T17:53:27Z) - Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment [10.104085497265004]
We propose Ranking Loss based Knowledge Distillation (RLKD), which encourages consistency of peak predictions between the teacher and student models.
Our method enables the student model to better learn the multi-modal distributions of the teacher model, leading to a significant performance improvement in various downstream tasks.
arXiv Detail & Related papers (2024-09-19T08:06:42Z) - Correlation-Decoupled Knowledge Distillation for Multimodal Sentiment Analysis with Incomplete Modalities [16.69453837626083]
We propose a Correlation-decoupled Knowledge Distillation (CorrKD) framework for the Multimodal Sentiment Analysis (MSA) task under uncertain missing modalities.
We present a sample-level contrastive distillation mechanism that transfers comprehensive knowledge containing cross-sample correlations to reconstruct missing semantics.
We design a response-disentangled consistency distillation strategy to optimize the sentiment decision boundaries of the student network.
arXiv Detail & Related papers (2024-04-25T09:35:09Z) - ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting [124.69672273754144]
Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs)
Existing CoT approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts.
We introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts.
arXiv Detail & Related papers (2024-03-21T11:34:26Z) - Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift [14.641747166801133]
multimodal contrastive learning approaches, such as CLIP, have achieved a remarkable success in learning representations that are robust against distribution shift.
We identify two mechanisms behind MMCL's robustness: emphintra-class contrasting and emphinter-class feature sharing.
We theoretically demonstrate the benefits of using rich captions on robustness and explore the effect of annotating different types of details in the captions.
arXiv Detail & Related papers (2023-10-08T02:25:52Z) - Leveraging Diffusion Disentangled Representations to Mitigate Shortcuts
in Underspecified Visual Tasks [92.32670915472099]
We propose an ensemble diversification framework exploiting the generation of synthetic counterfactuals using Diffusion Probabilistic Models (DPMs)
We show that diffusion-guided diversification can lead models to avert attention from shortcut cues, achieving ensemble diversity performance comparable to previous methods requiring additional data collection.
arXiv Detail & Related papers (2023-10-03T17:37:52Z) - Chain-of-Thought Prompt Distillation for Multimodal Named Entity
Recognition and Multimodal Relation Extraction [8.169359626365619]
We generate a textitchain of thought (CoT) -- a sequence of intermediate reasoning steps.
We present a novel conditional prompt distillation method to assimilate the commonsense reasoning ability from large language models.
Our approach attains state-of-the-art accuracy and manifests a plethora of advantages concerning interpretability, data efficiency, and cross-domain generalization.
arXiv Detail & Related papers (2023-06-25T04:33:56Z) - Multimodal Chain-of-Thought Reasoning in Language Models [94.70184390935661]
We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework.
Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach.
arXiv Detail & Related papers (2023-02-02T07:51:19Z) - Trusted Multi-View Classification with Dynamic Evidential Fusion [73.35990456162745]
We propose a novel multi-view classification algorithm, termed trusted multi-view classification (TMC)
TMC provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level.
Both theoretical and experimental results validate the effectiveness of the proposed model in accuracy, robustness and trustworthiness.
arXiv Detail & Related papers (2022-04-25T03:48:49Z) - KDExplainer: A Task-oriented Attention Model for Explaining Knowledge
Distillation [59.061835562314066]
We introduce a novel task-oriented attention model, termed as KDExplainer, to shed light on the working mechanism underlying the vanilla KD.
We also introduce a portable tool, dubbed as virtual attention module (VAM), that can be seamlessly integrated with various deep neural networks (DNNs) to enhance their performance under KD.
arXiv Detail & Related papers (2021-05-10T08:15:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.