Open-Domain Dialogue Generation Based on Pre-trained Language Models
- URL: http://arxiv.org/abs/2010.12780v1
- Date: Sat, 24 Oct 2020 04:52:28 GMT
- Title: Open-Domain Dialogue Generation Based on Pre-trained Language Models
- Authors: Yan Zeng and Jian-Yun Nie
- Abstract summary: Pre-trained language models have been successfully used in response generation for open-domain dialogue.
Four main frameworks have been proposed: Transformer-ED using Transformer encoder and decoder separately for source and target sentences; (2) Transformer-Dec using Transformer decoder for both source and target sentences; and (3) Transformer-MLM using Transformer decoder that applies bi-directional attention on the source side and left-to-right attention on the target side with masked language model objective.
We compare these frameworks on 3 datasets, and our comparison reveals that the best framework uses bidirectional attention on the source side and does not separate encoder and decoder.
- Score: 23.828348485513043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models have been successfully used in response
generation for open-domain dialogue. Four main frameworks have been proposed:
(1) Transformer-ED using Transformer encoder and decoder separately for source
and target sentences; (2) Transformer-Dec using Transformer decoder for both
source and target sentences; (3) Transformer-MLM using Transformer decoder that
applies bi-directional attention on the source side and left-to-right attention
on the target side with masked language model objective; and (4) Transformer-AR
that uses auto-regressive objective instead. In this study, we compare these
frameworks on 3 datasets, and our comparison reveals that the best framework
uses bidirectional attention on the source side and does not separate encoder
and decoder. We also examine model discrepancy, and our experiments confirm
that the performance of a model is directly impacted by the underlying
discrepancies. We then propose two correction methods to reduce the
discrepancies, and both improve the model performance. These results show that
discrepancies is an important factor to consider when we use a pre-trained
model, and a reduction in discrepancies can lead to improved performance.
Related papers
- Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity [11.302828987873497]
We present a Cross-Architecture Layerwise Distillation (CALD) approach that jointly converts a transformer model to a linear time substitute and fine-tunes it to a target task.
We show that CALD can effectively recover the result of the original model, and that the guiding strategy contributes to the result.
arXiv Detail & Related papers (2024-10-09T13:06:43Z) - Making the Most of your Model: Methods for Finetuning and Applying Pretrained Transformers [0.21756081703276003]
This thesis provides methods and analysis of models which make progress on this goal.
We introduce two new finetuning methods which add new capabilities to the models they are used on.
We provide theoretical and empirical insights on the divergence of model-likelihood and output quality.
arXiv Detail & Related papers (2024-08-29T03:50:24Z) - Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly
Detectors [117.61449210940955]
We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level.
We introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects.
We generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames.
arXiv Detail & Related papers (2023-06-21T06:18:05Z) - Stateful Memory-Augmented Transformers for Efficient Dialogue Modeling [69.31802246621963]
We propose a novel memory-augmented transformer that is compatible with existing pre-trained encoder-decoder models.
By incorporating a separate memory module alongside the pre-trained transformer, the model can effectively interchange information between the memory states and the current input context.
arXiv Detail & Related papers (2022-09-15T22:37:22Z) - On the Role of Bidirectionality in Language Model Pre-Training [85.14614350372004]
We study the role of bidirectionality in next token prediction, text infilling, zero-shot priming and fine-tuning.
We train models with up to 6.7B parameters, and find differences to remain consistent at scale.
arXiv Detail & Related papers (2022-05-24T02:25:05Z) - Diformer: Directional Transformer for Neural Machine Translation [13.867255817435705]
Autoregressive (AR) and Non-autoregressive (NAR) models have their own superiority on the performance and latency.
We propose the Directional Transformer (Diformer) by jointly modelling AR and NAR into three generation directions.
Experiments on 4 WMT benchmarks demonstrate that Diformer outperforms current united-modelling works with more than 1.5 BLEU points for both AR and NAR decoding.
arXiv Detail & Related papers (2021-12-22T02:35:29Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - Dual-decoder Transformer for Joint Automatic Speech Recognition and
Multilingual Speech Translation [71.54816893482457]
We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST)
Our models are based on the original Transformer architecture but consist of two decoders, each responsible for one task (ASR or ST)
arXiv Detail & Related papers (2020-11-02T04:59:50Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.