To Copy, or not to Copy; That is a Critical Issue of the Output Softmax
Layer in Neural Sequential Recommenders
- URL: http://arxiv.org/abs/2310.14079v1
- Date: Sat, 21 Oct 2023 18:04:04 GMT
- Title: To Copy, or not to Copy; That is a Critical Issue of the Output Softmax
Layer in Neural Sequential Recommenders
- Authors: Haw-Shiuan Chang, Nikhil Agarwal, Andrew McCallum
- Abstract summary: In this study, we identify a major source of the problem: the single hidden state embedding and static item embeddings in the output softmax layer.
We adapt the recently-proposed softmax alternatives such as softmax-CPR to sequential recommendation tasks and demonstrate that the new softmax architectures unleash the capability of the neural encoder on learning when to copy and when to exclude the items from the input sequence.
- Score: 48.8643117818312
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies suggest that the existing neural models have difficulty
handling repeated items in sequential recommendation tasks. However, our
understanding of this difficulty is still limited. In this study, we
substantially advance this field by identifying a major source of the problem:
the single hidden state embedding and static item embeddings in the output
softmax layer. Specifically, the similarity structure of the global item
embeddings in the softmax layer sometimes forces the single hidden state
embedding to be close to new items when copying is a better choice, while
sometimes forcing the hidden state to be close to the items from the input
inappropriately. To alleviate the problem, we adapt the recently-proposed
softmax alternatives such as softmax-CPR to sequential recommendation tasks and
demonstrate that the new softmax architectures unleash the capability of the
neural encoder on learning when to copy and when to exclude the items from the
input sequence. By only making some simple modifications on the output softmax
layer for SASRec and GRU4Rec, softmax-CPR achieves consistent improvement in 12
datasets. With almost the same model size, our best method not only improves
the average NDCG@10 of GRU4Rec in 5 datasets with duplicated items by 10%
(4%-17% individually) but also improves 7 datasets without duplicated items by
24% (8%-39%)!
Related papers
- Revisiting the Architectures like Pointer Networks to Efficiently
Improve the Next Word Distribution, Summarization Factuality, and Beyond [37.96043934146189]
We propose several softmax alternatives by simplifying the pointer networks and accelerating the word-by-word rerankers.
In GPT-2, our proposals are significantly better and more efficient than mixture of softmax.
Our best method based on T5-Small improves factCC score by 2 points in CNN/DM and XSUM dataset, and improves MAUVE scores by 30% in BookSum paragraph-level dataset.
arXiv Detail & Related papers (2023-05-20T21:52:24Z) - The In-Sample Softmax for Offline Reinforcement Learning [37.37457955279337]
Reinforcement learning (RL) agents can leverage batches of previously collected data to extract a reasonable control policy.
Standard max operator may select a maximal action that has not been seen in the dataset.
bootstrapping from these inaccurate values can lead to overestimation and even divergence.
arXiv Detail & Related papers (2023-02-28T07:55:02Z) - To Softmax, or not to Softmax: that is the question when applying Active
Learning for Transformer Models [24.43410365335306]
A well known technique to reduce the amount of human effort in acquiring a labeled dataset is textitActive Learning (AL)
This paper compares eight alternatives on seven datasets.
Most of the methods are too good at identifying the true most uncertain samples (outliers) and that labeling exclusively results in worse performance.
arXiv Detail & Related papers (2022-10-06T15:51:39Z) - Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation [2.3813678058429626]
The softmax function is widely used in artificial neural networks for the multiclass classification problems.
In this paper, we provide an empirical study on a simple and concise softmax variant, namely sparse-softmax, to alleviate the problem that occurred in traditional softmax in terms of high-dimensional classification problems.
arXiv Detail & Related papers (2021-12-23T09:53:38Z) - Breaking the Softmax Bottleneck for Sequential Recommender Systems with
Dropout and Decoupling [0.0]
We show that there are more aspects to the Softmax bottleneck in SBRSs.
We propose a simple yet effective method, Dropout and Decoupling (D&D), to alleviate these problems.
Our method significantly improves the accuracy of a variety of Softmax-based SBRS algorithms.
arXiv Detail & Related papers (2021-10-11T16:52:23Z) - Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU.
Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms.
Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z) - Modeling Token-level Uncertainty to Learn Unknown Concepts in SLU via
Calibrated Dirichlet Prior RNN [98.4713940310056]
One major task of spoken language understanding (SLU) in modern personal assistants is to extract semantic concepts from an utterance.
Recent research collected question and answer annotated data to learn what is unknown and should be asked.
We incorporate softmax-based slot filling neural architectures to model the sequence uncertainty without question supervision.
arXiv Detail & Related papers (2020-10-16T02:12:30Z) - Taming GANs with Lookahead-Minmax [63.90038365274479]
Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead-minmax with Adam or extragradient.
Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels.
arXiv Detail & Related papers (2020-06-25T17:13:23Z) - Least Squares Regression with Markovian Data: Fundamental Limits and
Algorithms [69.45237691598774]
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain.
We establish sharp information theoretic minimax lower bounds for this problem in terms of $tau_mathsfmix$.
We propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate.
arXiv Detail & Related papers (2020-06-16T04:26:50Z) - A Generic Network Compression Framework for Sequential Recommender
Systems [71.81962915192022]
Sequential recommender systems (SRS) have become the key technology in capturing user's dynamic interests and generating high-quality recommendations.
We propose a compressed sequential recommendation framework, termed as CpRec, where two generic model shrinking techniques are employed.
By the extensive ablation studies, we demonstrate that the proposed CpRec can achieve up to 4$sim$8 times compression rates in real-world SRS datasets.
arXiv Detail & Related papers (2020-04-21T08:40:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.