Related papers: Revisiting the Architectures like Pointer Networks to Efficiently Improve the Next Word Distribution, Summarization Factuality, and Beyond

Revisiting the Architectures like Pointer Networks to Efficiently Improve the Next Word Distribution, Summarization Factuality, and Beyond

URL: http://arxiv.org/abs/2305.12289v1
Date: Sat, 20 May 2023 21:52:24 GMT
Title: Revisiting the Architectures like Pointer Networks to Efficiently Improve the Next Word Distribution, Summarization Factuality, and Beyond
Authors: Haw-Shiuan Chang, Zonghai Yao, Alolika Gon, Hong Yu, Andrew McCallum
Abstract summary: We propose several softmax alternatives by simplifying the pointer networks and accelerating the word-by-word rerankers. In GPT-2, our proposals are significantly better and more efficient than mixture of softmax. Our best method based on T5-Small improves factCC score by 2 points in CNN/DM and XSUM dataset, and improves MAUVE scores by 30% in BookSum paragraph-level dataset.
Score: 37.96043934146189
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Is the output softmax layer, which is adopted by most language models (LMs), always the best way to compute the next word probability? Given so many attention layers in a modern transformer-based LM, are the pointer networks redundant nowadays? In this study, we discover that the answers to both questions are no. This is because the softmax bottleneck sometimes prevents the LMs from predicting the desired distribution and the pointer networks can be used to break the bottleneck efficiently. Based on the finding, we propose several softmax alternatives by simplifying the pointer networks and accelerating the word-by-word rerankers. In GPT-2, our proposals are significantly better and more efficient than mixture of softmax, a state-of-the-art softmax alternative. In summarization experiments, without significantly decreasing its training/testing speed, our best method based on T5-Small improves factCC score by 2 points in CNN/DM and XSUM dataset, and improves MAUVE scores by 30% in BookSum paragraph-level dataset.

Related papers

Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications [79.53938312089308]
The MIDX-Sampler is a novel adaptive sampling strategy based on an inverted multi-index approach. Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds.
arXiv Detail & Related papers (2025-01-15T04:09:21Z)
MultiMax: Sparse and Multi-Modal Attention Learning [60.49318008131978]
SoftMax is a ubiquitous ingredient of modern machine learning algorithms. We show that sparsity can be achieved by a family of SoftMax variants, but they often require an alternative loss function and do not preserve multi-modality. We propose MultiMax, which adaptively modulates the output distribution according to input entry range.
arXiv Detail & Related papers (2024-06-03T10:51:43Z)
To Copy, or not to Copy; That is a Critical Issue of the Output Softmax Layer in Neural Sequential Recommenders [48.8643117818312]
In this study, we identify a major source of the problem: the single hidden state embedding and static item embeddings in the output softmax layer. We adapt the recently-proposed softmax alternatives such as softmax-CPR to sequential recommendation tasks and demonstrate that the new softmax architectures unleash the capability of the neural encoder on learning when to copy and when to exclude the items from the input sequence.
arXiv Detail & Related papers (2023-10-21T18:04:04Z)
r-softmax: Generalized Softmax with Controllable Sparsity Rate [11.39524236962986]
We propose r-softmax, a modification of the softmax, outputting sparse probability distribution with controllable sparsity rate. We show on several multi-label datasets that r-softmax outperforms other sparse alternatives to softmax and is highly competitive with the original softmax.
arXiv Detail & Related papers (2023-04-11T14:28:29Z)
A Study on ReLU and Softmax in Transformer [51.0740713922741]
The Transformer architecture consists of self-attention and feed-forward networks (FFNs) which can be viewed as key-value memories. We first rebuild the connections between FFN and key-value memory by conducting extensive studies on ReLU and Softmax. In addition, ReLU outperforms Softmax on both FFN and key-value memory when the number of value slots is large.
arXiv Detail & Related papers (2023-02-13T15:41:20Z)
To Softmax, or not to Softmax: that is the question when applying Active Learning for Transformer Models [24.43410365335306]
A well known technique to reduce the amount of human effort in acquiring a labeled dataset is textitActive Learning (AL) This paper compares eight alternatives on seven datasets. Most of the methods are too good at identifying the true most uncertain samples (outliers) and that labeling exclusively results in worse performance.
arXiv Detail & Related papers (2022-10-06T15:51:39Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
A multi-model-based deep learning framework for short text multiclass classification with the imbalanced and extremely small data set [0.6875312133832077]
This paper proposes a multimodel-based deep learning framework for short-text multiclass classification with an imbalanced and extremely small data set. It retains the state-of-the-art baseline performance in terms of precision, recall, accuracy, and F1 score.
arXiv Detail & Related papers (2022-06-24T00:51:02Z)
Breaking the Softmax Bottleneck for Sequential Recommender Systems with Dropout and Decoupling [0.0]
We show that there are more aspects to the Softmax bottleneck in SBRSs. We propose a simple yet effective method, Dropout and Decoupling (D&D), to alleviate these problems. Our method significantly improves the accuracy of a variety of Softmax-based SBRS algorithms.
arXiv Detail & Related papers (2021-10-11T16:52:23Z)
Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z)
Effectiveness of MPC-friendly Softmax Replacement [13.710300609457267]
We analyze the two uses of the softmax replacement and compare them to softmax. We found that the replacement only provides a significant speed-up for a one-layer network while it always reduces accuracy, sometimes significantly.
arXiv Detail & Related papers (2020-11-23T04:14:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.