Related papers: Exploring the Space of Key-Value-Query Models with Intention

Exploring the Space of Key-Value-Query Models with Intention

URL: http://arxiv.org/abs/2305.10203v1
Date: Wed, 17 May 2023 13:25:57 GMT
Title: Exploring the Space of Key-Value-Query Models with Intention
Authors: Marta Garnelo, Wojciech Marian Czarnecki
Abstract summary: Two key components of Attention are the structure of its input (which consists of keys, values and queries) and the computations by which these three are combined. We refer to this space as Keys-Values-Queries ( KVQ) Space. Our goal is to determine whether there are any other stackable models in KVQ Space that Attention cannot efficiently approximate.
Score: 8.585795909956726
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention-based models have been a key element of many recent breakthroughs in deep learning. Two key components of Attention are the structure of its input (which consists of keys, values and queries) and the computations by which these three are combined. In this paper we explore the space of models that share said input structure but are not restricted to the computations of Attention. We refer to this space as Keys-Values-Queries (KVQ) Space. Our goal is to determine whether there are any other stackable models in KVQ Space that Attention cannot efficiently approximate, which we can implement with our current deep learning toolbox and that solve problems that are interesting to the community. Maybe surprisingly, the solution to the standard least squares problem satisfies these properties. A neural network module that is able to compute this solution not only enriches the set of computations that a neural network can represent but is also provably a strict generalisation of Linear Attention. Even more surprisingly the computational complexity of this module is exactly the same as that of Attention, making it a suitable drop in replacement. With this novel connection between classical machine learning (least squares) and modern deep learning (Attention) established we justify a variation of our model which generalises regular Attention in the same way. Both new modules are put to the test an a wide spectrum of tasks ranging from few-shot learning to policy distillation that confirm their real-worlds applicability.

Related papers

Convolutional Rectangular Attention Module [3.3975558777609915]
We introduce a novel spatial attention module, that can be integrated to any convolutional network. This module guides the model to pay attention to the most discriminative part of an image.
arXiv Detail & Related papers (2025-03-13T20:41:36Z)
Neural Metamorphosis [72.88137795439407]
This paper introduces a new learning paradigm termed Neural Metamorphosis (NeuMeta), which aims to build self-morphable neural networks. NeuMeta directly learns the continuous weight manifold of neural networks. It sustains full-size performance even at a 75% compression rate.
arXiv Detail & Related papers (2024-10-10T14:49:58Z)
Class incremental learning with probability dampening and cascaded gated classifier [4.285597067389559]
We propose a novel incremental regularisation approach called Margin Dampening and Cascaded Scaling. The first combines a soft constraint and a knowledge distillation approach to preserve past knowledge while allowing forgetting new patterns. We empirically show that our approach performs well on multiple benchmarks well-established baselines.
arXiv Detail & Related papers (2024-02-02T09:33:07Z)
Learning Structure-from-Motion with Graph Attention Networks [23.87562683118926]
We tackle the problem of learning Structure-from-Motion (SfM) through the use of graph attention networks. In this work we learn a model that takes as input the 2D keypoints detected across multiple views, and outputs the corresponding camera poses and 3D keypoint coordinates. Our model takes advantage of graph neural networks to learn SfM-specific primitives, and we show that it can be used for fast inference of the reconstruction for new and unseen sequences.
arXiv Detail & Related papers (2023-08-30T12:13:13Z)
Modular Neural Network Approaches for Surgical Image Recognition [0.0]
We introduce and evaluate different architectures of modular learning for Dorsal Capsulo-Scapholunate Septum (DCSS) instability classification. Our experiments have shown that modular learning improves performances compared to non-modular systems. In the second part, we present our approach for data labeling and segmentation with self-training applied on shoulder arthroscopy images.
arXiv Detail & Related papers (2023-07-17T22:28:16Z)
Neural Attentive Circuits [93.95502541529115]
We introduce a general purpose, yet modular neural architecture called Neural Attentive Circuits (NACs) NACs learn the parameterization and a sparse connectivity of neural modules without using domain knowledge. NACs achieve an 8x speedup at inference time while losing less than 3% performance.
arXiv Detail & Related papers (2022-10-14T18:00:07Z)
Part-Based Models Improve Adversarial Robustness [57.699029966800644]
We show that combining human prior knowledge with end-to-end learning can improve the robustness of deep neural networks. Our model combines a part segmentation model with a tiny classifier and is trained end-to-end to simultaneously segment objects into parts. Our experiments indicate that these models also reduce texture bias and yield better robustness against common corruptions and spurious correlations.
arXiv Detail & Related papers (2022-09-15T15:41:47Z)
Discrete Key-Value Bottleneck [95.61236311369821]
Deep neural networks perform well on classification tasks where data streams are i.i.d. and labeled data is abundant. One powerful approach that has addressed this challenge involves pre-training of large encoders on volumes of readily available data, followed by task-specific tuning. Given a new task, however, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks. We propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable key-value codes.
arXiv Detail & Related papers (2022-07-22T17:52:30Z)
Rethinking Query-Key Pairwise Interactions in Vision Transformers [5.141895475956681]
We propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights. We develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark.
arXiv Detail & Related papers (2022-07-01T03:36:49Z)
LCS: Learning Compressible Subspaces for Adaptive Network Compression at Inference Time [57.52251547365967]
We propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models. We present results for achieving arbitrarily fine-grained accuracy-efficiency trade-offs at inference time for structured and unstructured sparsity. Our algorithm extends to quantization at variable bit widths, achieving accuracy on par with individually trained networks.
arXiv Detail & Related papers (2021-10-08T17:03:34Z)
Stochastic tensor space feature theory with applications to robust machine learning [3.6891975755608355]
We develop a Multilevel Orthogonal Subspace (MOS) Karhunen-Loeve feature theory based on tensor spaces. Our key observation is that separate machine learning classes can reside predominantly in mostly distinct subspaces. Tests in the blood plasma dataset (Alzheimer's Disease Neuroimaging Initiative) show dramatic increases in accuracy.
arXiv Detail & Related papers (2021-10-04T22:01:01Z)
Unravelling Small Sample Size Problems in the Deep Learning World [69.82853912238173]
We first present a review of deep learning algorithms for small sample size problems in which the algorithms are segregated according to the space in which they operate. Secondly, we present Dynamic Attention Pooling approach which focuses on extracting global information from the most discriminative sub-part of the feature map.
arXiv Detail & Related papers (2020-08-08T13:35:49Z)
A new nature inspired modularity function adapted for unsupervised learning involving spatially embedded networks: A comparative analysis [0.0]
Unsupervised machine learning methods can be of great help in many traditional engineering disciplines. We have compared the performance of our newly developed modularity function with some of the well-known modularity functions. We show that for the class of networks considered in this article, our method produce much better results than the competing methods.
arXiv Detail & Related papers (2020-07-18T04:32:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.