Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling
- URL: http://arxiv.org/abs/2402.18508v2
- Date: Fri, 24 May 2024 05:51:52 GMT
- Title: Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling
- Authors: Mahdi Karami, Ali Ghodsi,
- Abstract summary: This paper introduces Orchid, a novel architecture designed to address the quadratic complexity of traditional attention mechanisms.
At the core of this architecture lies a new data-dependent global convolution layer, which contextually adapts its conditioned kernel on input sequence.
We evaluate the proposed model across multiple domains, including language modeling and image classification, to highlight its performance and generality.
- Score: 4.190836962132713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the rapidly evolving field of deep learning, the demand for models that are both expressive and computationally efficient has never been more critical. This paper introduces Orchid, a novel architecture designed to address the quadratic complexity of traditional attention mechanisms without compromising the ability to capture long-range dependencies and in-context learning. At the core of this architecture lies a new data-dependent global convolution layer, which contextually adapts its kernel conditioned on input sequence using a dedicated conditioning neural network. We design two simple conditioning networks that maintain shift equivariance in our data-dependent convolution operation. The dynamic nature of the proposed convolution kernel grants Orchid high expressivity while maintaining quasilinear scalability for long sequences. We evaluate the proposed model across multiple domains, including language modeling and image classification, to highlight its performance and generality. Our experiments demonstrate that this architecture not only outperforms traditional attention-based architectures such as BERT and Vision Transformers with smaller model sizes, but also extends the feasible sequence length beyond the limitations of the dense attention layers. This achievement represents a significant step towards more efficient and scalable deep learning models for sequence modeling.
Related papers
- Test-time regression: a unifying framework for designing sequence models with associative memory [24.915262407519876]
We show that effective sequence models must be able to perform associative recall.
Our key insight is that memorizing input tokens through an associative memory is equivalent to performing regression at test-time.
We show numerous recent architectures -- including linear attention models, their gated variants, state-space models, online learners, and softmax attention -- emerge naturally as specific approaches to test-time regression.
arXiv Detail & Related papers (2025-01-21T18:32:31Z) - Multi-Head Self-Attending Neural Tucker Factorization [5.734615417239977]
We introduce a neural network-based tensor factorization approach tailored for learning representations of high-dimensional and incomplete (HDI) tensors.
The proposed MSNTucF model demonstrates superior performance compared to state-of-the-art benchmark models in estimating missing observations.
arXiv Detail & Related papers (2025-01-16T13:04:15Z) - STAR: Synthesis of Tailored Architectures [61.080157488857516]
We propose a new approach for the synthesis of tailored architectures (STAR)
Our approach combines a novel search space based on the theory of linear input-varying systems, supporting a hierarchical numerical encoding into architecture genomes. STAR genomes are automatically refined and recombined with gradient-free, evolutionary algorithms to optimize for multiple model quality and efficiency metrics.
Using STAR, we optimize large populations of new architectures, leveraging diverse computational units and interconnection patterns, improving over highly-optimized Transformers and striped hybrid models on the frontier of quality, parameter size, and inference cache for autoregressive language modeling.
arXiv Detail & Related papers (2024-11-26T18:42:42Z) - Adaptable Embeddings Network (AEN) [49.1574468325115]
We introduce Adaptable Embeddings Networks (AEN), a novel dual-encoder architecture using Kernel Density Estimation (KDE)
AEN allows for runtime adaptation of classification criteria without retraining and is non-autoregressive.
The architecture's ability to preprocess and cache condition embeddings makes it ideal for edge computing applications and real-time monitoring systems.
arXiv Detail & Related papers (2024-11-21T02:15:52Z) - Topological Deep Learning with State-Space Models: A Mamba Approach for Simplicial Complexes [4.787059527893628]
We propose a novel architecture designed to operate with simplicial complexes, utilizing the Mamba state-space model as its backbone.
Our approach generates sequences for the nodes based on the neighboring cells, enabling direct communication between all higher-order structures, regardless of their rank.
arXiv Detail & Related papers (2024-09-18T14:49:25Z) - GrootVL: Tree Topology is All You Need in State Space Model [66.36757400689281]
GrootVL is a versatile multimodal framework that can be applied to both visual and textual tasks.
Our method significantly outperforms existing structured state space models on image classification, object detection and segmentation.
By fine-tuning large language models, our approach achieves consistent improvements in multiple textual tasks at minor training cost.
arXiv Detail & Related papers (2024-06-04T15:09:29Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Learning From Simplicial Data Based on Random Walks and 1D Convolutions [6.629765271909503]
simplicial complex neural network learning architecture based on random walks and fast 1D convolutions.
We empirically evaluate SCRaWl on real-world datasets and show that it outperforms other simplicial neural networks.
arXiv Detail & Related papers (2024-04-04T13:27:22Z) - Multi-Scale Semantics-Guided Neural Networks for Efficient
Skeleton-Based Human Action Recognition [140.18376685167857]
A simple yet effective multi-scale semantics-guided neural network is proposed for skeleton-based action recognition.
MS-SGN achieves the state-of-the-art performance on the NTU60, NTU120, and SYSU datasets.
arXiv Detail & Related papers (2021-11-07T03:50:50Z) - DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications [0.0]
One of the limitations of deep learning models with sparse features today stems from the predefined nature of their input.
We show that the resulting models are able to perform better and efficiently run at a much larger scale.
arXiv Detail & Related papers (2020-04-17T17:43:51Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.