SpikeBERT: A Language Spikformer Learned from BERT with Knowledge
Distillation
- URL: http://arxiv.org/abs/2308.15122v4
- Date: Wed, 21 Feb 2024 13:20:21 GMT
- Title: SpikeBERT: A Language Spikformer Learned from BERT with Knowledge
Distillation
- Authors: Changze Lv, Tianlong Li, Jianhan Xu, Chenxi Gu, Zixuan Ling, Cenyuan
Zhang, Xiaoqing Zheng, Xuanjing Huang
- Abstract summary: Spiking neural networks (SNNs) offer a promising avenue to implement deep neural networks in a more energy-efficient way.
We improve a recently-proposed spiking Transformer (i.e., Spikformer) to make it possible to process language tasks.
We show that the models trained with our method, named SpikeBERT, outperform state-of-the-art SNNs and even achieve comparable results to BERTs on text classification tasks for both English and Chinese.
- Score: 31.777019330200705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spiking neural networks (SNNs) offer a promising avenue to implement deep
neural networks in a more energy-efficient way. However, the network
architectures of existing SNNs for language tasks are still simplistic and
relatively shallow, and deep architectures have not been fully explored,
resulting in a significant performance gap compared to mainstream
transformer-based networks such as BERT. To this end, we improve a
recently-proposed spiking Transformer (i.e., Spikformer) to make it possible to
process language tasks and propose a two-stage knowledge distillation method
for training it, which combines pre-training by distilling knowledge from BERT
with a large collection of unlabelled texts and fine-tuning with task-specific
instances via knowledge distillation again from the BERT fine-tuned on the same
training examples. Through extensive experimentation, we show that the models
trained with our method, named SpikeBERT, outperform state-of-the-art SNNs and
even achieve comparable results to BERTs on text classification tasks for both
English and Chinese with much less energy consumption. Our code is available at
https://github.com/Lvchangze/SpikeBERT.
Related papers
- MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems
to Improve Language Understanding [73.24847320536813]
This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders.
Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU.
arXiv Detail & Related papers (2022-04-15T03:44:00Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Optimizing small BERTs trained for German NER [0.16058099298620418]
We investigate various training techniques of smaller BERT models and evaluate them on five public German NER tasks.
We propose two new fine-tuning techniques leading to better performance: CSE-tagging and a modified form of LCRF.
Furthermore, we introduce a new technique called WWA which reduces BERT memory usage and leads to a small increase in performance.
arXiv Detail & Related papers (2021-04-23T12:36:13Z) - Using Prior Knowledge to Guide BERT's Attention in Semantic Textual
Matching Tasks [13.922700041632302]
We study the problem of incorporating prior knowledge into a deep Transformer-based model,i.e.,Bidirectional Representations from Transformers (BERT)
We obtain better understanding of what task-specific knowledge BERT needs the most and where it is most needed.
Experiments demonstrate that the proposed knowledge-enhanced BERT is able to consistently improve semantic textual matching performance.
arXiv Detail & Related papers (2021-02-22T12:07:16Z) - Evaluation of BERT and ALBERT Sentence Embedding Performance on
Downstream NLP Tasks [4.955649816620742]
This paper explores on sentence embedding models for BERT and ALBERT.
We take a modified BERT network with siamese and triplet network structures called Sentence-BERT (SBERT) and replace BERT with ALBERT to create Sentence-ALBERT (SALBERT)
arXiv Detail & Related papers (2021-01-26T09:14:06Z) - E-BERT: A Phrase and Product Knowledge Enhanced Language Model for
E-commerce [63.333860695727424]
E-commerce tasks require accurate understanding of domain phrases, whereas such fine-grained phrase-level knowledge is not explicitly modeled by BERT's training objective.
To tackle the problem, we propose a unified pre-training framework, namely, E-BERT.
Specifically, to preserve phrase-level knowledge, we introduce Adaptive Hybrid Masking, which allows the model to adaptively switch from learning preliminary word knowledge to learning complex phrases.
To utilize product-level knowledge, we introduce Neighbor Product Reconstruction, which trains E-BERT to predict a product's associated neighbors with a denoising cross attention layer
arXiv Detail & Related papers (2020-09-07T00:15:36Z) - Neural Entity Linking on Technical Service Tickets [1.3621712165154805]
We show that a neural approach outperforms and complements hand-coded entities, with improvements of about 20% top-1 accuracy.
We also show that a simple sentence-wise encoding (Bi-Encoder) offers a fast yet efficient search in practice.
arXiv Detail & Related papers (2020-05-15T15:47:02Z) - lamBERT: Language and Action Learning Using Multimodal BERT [0.1942428068361014]
This study proposes the language and action learning using multimodal BERT (lamBERT) model.
Experiment is conducted in a grid environment that requires language understanding for the agent to act properly.
The lamBERT model obtained higher rewards in multitask settings and transfer settings when compared to other models.
arXiv Detail & Related papers (2020-04-15T13:54:55Z) - DynaBERT: Dynamic BERT with Adaptive Width and Depth [55.18269622415814]
We propose a novel dynamic BERT model (abbreviated as DynaBERT)
It can flexibly adjust the size and latency by selecting adaptive width and depth.
It consistently outperforms existing BERT compression methods.
arXiv Detail & Related papers (2020-04-08T15:06:28Z) - Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation [84.64004917951547]
Fine-tuning pre-trained language models like BERT has become an effective way in NLP.
In this paper, we improve the fine-tuning of BERT with two effective mechanisms: self-ensemble and self-distillation.
arXiv Detail & Related papers (2020-02-24T16:17:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.