I2D2: Inductive Knowledge Distillation with NeuroLogic and
Self-Imitation
- URL: http://arxiv.org/abs/2212.09246v3
- Date: Fri, 26 May 2023 17:14:27 GMT
- Title: I2D2: Inductive Knowledge Distillation with NeuroLogic and
Self-Imitation
- Authors: Chandra Bhagavatula, Jena D. Hwang, Doug Downey, Ronan Le Bras, Ximing
Lu, Lianhui Qin, Keisuke Sakaguchi, Swabha Swayamdipta, Peter West, Yejin
Choi
- Abstract summary: We study generative models of commonsense knowledge, focusing on the task of generating generics.
We introduce I2D2, a novel commonsense distillation framework that loosely follows the Symbolic Knowledge Distillation of West et al.
Our study leads to a new corpus of generics, Gen-A-tomic, that is the largest and highest quality available to date.
- Score: 89.38161262164586
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Commonsense capabilities of pre-trained language models dramatically improve
with scale, leading many to believe that scale is the only winning recipe. But
is it? Here, we investigate an alternative that a priori seems impossible: can
smaller language models (e.g., GPT-2) win over models that are orders of
magnitude larger and better (e.g., GPT-3), if powered with novel commonsense
distillation algorithms? The key intellectual challenge is to design a learning
algorithm that achieve a competitive level of commonsense acquisition, without
relying on the benefits of scale. In particular, we study generative models of
commonsense knowledge, focusing on the task of generating generics, statements
of commonsense facts about everyday concepts, e.g., birds can fly.
We introduce I2D2, a novel commonsense distillation framework that loosely
follows the Symbolic Knowledge Distillation of West et al. but breaks the
dependence on the extreme-scale teacher model with two innovations: (1) the
novel adaptation of NeuroLogic Decoding to enhance the generation quality of
the weak, off-the-shelf language models, and (2) self-imitation learning to
iteratively learn from the model's own enhanced commonsense acquisition
capabilities. Empirical results suggest that scale is not the only way, as
novel algorithms can be a promising alternative. Moreover, our study leads to a
new corpus of generics, Gen-A-tomic, that is the largest and highest quality
available to date.
Related papers
- BSDP: Brain-inspired Streaming Dual-level Perturbations for Online Open
World Object Detection [31.467501311528498]
We aim to make deep learning models simulate the way people learn.
Existing OWOD approaches pay more attention to the identification of unknown categories, while the incremental learning part is also very important.
In this paper, we take the dual-level information of old samples as perturbations on new samples to make the model good at learning new knowledge without forgetting the old knowledge.
arXiv Detail & Related papers (2024-03-05T04:00:50Z) - Class incremental learning with probability dampening and cascaded gated classifier [4.285597067389559]
We propose a novel incremental regularisation approach called Margin Dampening and Cascaded Scaling.
The first combines a soft constraint and a knowledge distillation approach to preserve past knowledge while allowing forgetting new patterns.
We empirically show that our approach performs well on multiple benchmarks well-established baselines.
arXiv Detail & Related papers (2024-02-02T09:33:07Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - A Survey of Large Language Models [81.06947636926638]
Language modeling has been widely studied for language understanding and generation in the past two decades.
Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora.
To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size.
arXiv Detail & Related papers (2023-03-31T17:28:46Z) - Anti-Retroactive Interference for Lifelong Learning [65.50683752919089]
We design a paradigm for lifelong learning based on meta-learning and associative mechanism of the brain.
It tackles the problem from two aspects: extracting knowledge and memorizing knowledge.
It is theoretically analyzed that the proposed learning paradigm can make the models of different tasks converge to the same optimum.
arXiv Detail & Related papers (2022-08-27T09:27:36Z) - Twist Decoding: Diverse Generators Guide Each Other [116.20780037268801]
We introduce Twist decoding, a simple and general inference algorithm that generates text while benefiting from diverse models.
Our method does not assume the vocabulary, tokenization or even generation order is shared.
arXiv Detail & Related papers (2022-05-19T01:27:53Z) - Generated Knowledge Prompting for Commonsense Reasoning [53.88983683513114]
We propose generating knowledge statements directly from a language model with a generic prompt format.
This approach improves performance of both off-the-shelf and finetuned language models on four commonsense reasoning tasks.
Notably, we find that a model's predictions can improve when using its own generated knowledge.
arXiv Detail & Related papers (2021-10-15T21:58:03Z) - Symbolic Knowledge Distillation: from General Language Models to
Commonsense Models [38.29726383331247]
General language models author knowledge graphs to train commonsense models.
We distill knowledge symbolically-as text-in addition to the neural model.
For the first time, a human-authored commonsense knowledge graph is surpassed by our automatically distilled variant.
arXiv Detail & Related papers (2021-10-14T06:50:19Z) - DISCOS: Bridging the Gap between Discourse Knowledge and Commonsense
Knowledge [42.08569149041291]
We propose an alternative commonsense knowledge acquisition framework DISCOS.
DISCOS populates expensive commonsense knowledge to more affordable linguistic knowledge resources.
We can acquire 3.4M ATOMIC-like inferential commonsense knowledge by populating ATOMIC on the core part of ASER.
arXiv Detail & Related papers (2021-01-01T03:30:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.