NewsBERT: Distilling Pre-trained Language Model for Intelligent News
Application
- URL: http://arxiv.org/abs/2102.04887v1
- Date: Tue, 9 Feb 2021 15:41:12 GMT
- Title: NewsBERT: Distilling Pre-trained Language Model for Intelligent News
Application
- Authors: Chuhan Wu, Fangzhao Wu, Yang Yu, Tao Qi, Yongfeng Huang, Qi Liu
- Abstract summary: We propose NewsBERT, which can distill pre-trained language models for efficient and effective news intelligence.
In our approach, we design a teacher-student joint learning and distillation framework to collaboratively learn both teacher and student models.
In our experiments, NewsBERT can effectively improve the model performance in various intelligent news applications with much smaller models.
- Score: 56.1830016521422
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained language models (PLMs) like BERT have made great progress in NLP.
News articles usually contain rich textual information, and PLMs have the
potentials to enhance news text modeling for various intelligent news
applications like news recommendation and retrieval. However, most existing
PLMs are in huge size with hundreds of millions of parameters. Many online news
applications need to serve millions of users with low latency tolerance, which
poses huge challenges to incorporating PLMs in these scenarios. Knowledge
distillation techniques can compress a large PLM into a much smaller one and
meanwhile keeps good performance. However, existing language models are
pre-trained and distilled on general corpus like Wikipedia, which has some gaps
with the news domain and may be suboptimal for news intelligence. In this
paper, we propose NewsBERT, which can distill PLMs for efficient and effective
news intelligence. In our approach, we design a teacher-student joint learning
and distillation framework to collaboratively learn both teacher and student
models, where the student model can learn from the learning experience of the
teacher model. In addition, we propose a momentum distillation method by
incorporating the gradients of teacher model into the update of student model
to better transfer useful knowledge learned by the teacher model. Extensive
experiments on two real-world datasets with three tasks show that NewsBERT can
effectively improve the model performance in various intelligent news
applications with much smaller models.
Related papers
- LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - Talking Models: Distill Pre-trained Knowledge to Downstream Models via
Interactive Communication [25.653517213641575]
We develop an interactive communication process to help students of downstream tasks learn effectively from pre-trained foundation models.
Our design is inspired by the way humans learn from teachers who can explain knowledge in a way that meets the students' needs.
arXiv Detail & Related papers (2023-10-04T22:22:21Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - SN Computer Science: Towards Offensive Language Identification for Tamil
Code-Mixed YouTube Comments and Posts [2.0305676256390934]
This study presents extensive experiments using multiple deep learning, and transfer learning models to detect offensive content on YouTube.
We propose a novel and flexible approach of selective translation and transliteration techniques to reap better results from fine-tuning and ensembling multilingual transformer networks.
The proposed model ULMFiT and mBERTBiLSTM yielded good results and are promising for effective offensive speech identification in low-resourced languages.
arXiv Detail & Related papers (2021-08-24T20:23:30Z) - One Teacher is Enough? Pre-trained Language Model Distillation from
Multiple Teachers [54.146208195806636]
We propose a multi-teacher knowledge distillation framework named MT-BERT for pre-trained language model compression.
We show that MT-BERT can train high-quality student model from multiple teacher PLMs.
Experiments on three benchmark datasets validate the effectiveness of MT-BERT in compressing PLMs.
arXiv Detail & Related papers (2021-06-02T08:42:33Z) - Training Microsoft News Recommenders with Pretrained Language Models in
the Loop [22.96193782709208]
We propose a novel framework, SpeedyFeed, which efficiently trains PLMs-based news recommenders of superior quality.
SpeedyFeed is highlighted for its light-weighted encoding pipeline, which removes most of the repetitive but redundant encoding operations.
PLMs-based model significantly outperforms the state-of-the-art news recommenders in comprehensive offline experiments.
arXiv Detail & Related papers (2021-02-18T11:08:38Z) - InfoBERT: Improving Robustness of Language Models from An Information
Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks.
We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z) - lamBERT: Language and Action Learning Using Multimodal BERT [0.1942428068361014]
This study proposes the language and action learning using multimodal BERT (lamBERT) model.
Experiment is conducted in a grid environment that requires language understanding for the agent to act properly.
The lamBERT model obtained higher rewards in multitask settings and transfer settings when compared to other models.
arXiv Detail & Related papers (2020-04-15T13:54:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.