Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention
- URL: http://arxiv.org/abs/2504.08801v1
- Date: Tue, 08 Apr 2025 22:16:54 GMT
- Title: Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention
- Authors: Andrew Kiruluta, Priscilla Burity, Samantha Williams,
- Abstract summary: Learnable Multi-Scale Wavelet Transformer (LMWT) is a novel architecture that replaces the standard dot-product self-attention.<n>We present the detailed mathematical formulation of the learnable Haar wavelet module and its integration into the transformer framework.<n>Our results indicate that the LMWT achieves competitive performance while offering substantial computational advantages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer architectures, underpinned by the self-attention mechanism, have achieved state-of-the-art results across numerous natural language processing (NLP) tasks by effectively modeling long-range dependencies. However, the computational complexity of self-attention, scaling quadratically with input sequence length, presents significant challenges for processing very long sequences or operating under resource constraints. This paper introduces the Learnable Multi-Scale Wavelet Transformer (LMWT), a novel architecture that replaces the standard dot-product self-attention with a learnable multi-scale Haar wavelet transform module. Leveraging the intrinsic multi-resolution properties of wavelets, the LMWT efficiently captures both local details and global context. Crucially, the parameters of the wavelet transform, including scale-specific coefficients, are learned end-to-end during training, allowing the model to adapt its decomposition strategy to the data and task. We present the detailed mathematical formulation of the learnable Haar wavelet module and its integration into the transformer framework, supplemented by an architectural diagram. We conduct a comprehensive experimental evaluation on a standard machine translation benchmark (WMT16 En-De), comparing the LMWT against a baseline self-attention transformer using metrics like BLEU score, perplexity, and token accuracy. Furthermore, we analyze the computational complexity, highlighting the linear scaling of our approach, discuss its novelty in the context of related work, and explore the interpretability offered by visualizing the learned Haar coefficients. Our results indicate that the LMWT achieves competitive performance while offering substantial computational advantages, positioning it as a promising and novel alternative for efficient sequence modeling.
Related papers
- PointMT: Efficient Point Cloud Analysis with Hybrid MLP-Transformer Architecture [46.266960248570086]
This study tackles the quadratic complexity of the self-attention mechanism by introducing a complexity local attention mechanism for effective feature aggregation.
We also introduce a parameter-free channel temperature adaptation mechanism that adaptively adjusts the attention weight distribution in each channel.
We show that PointMT achieves performance comparable to state-of-the-art methods while maintaining an optimal balance between performance and accuracy.
arXiv Detail & Related papers (2024-08-10T10:16:03Z) - Unlocking the Power of Patch: Patch-Based MLP for Long-Term Time Series Forecasting [0.0]
Recent studies have attempted to refine the Transformer architecture to demonstrate its effectiveness in Long-Term Time Series Forecasting tasks.<n>We attribute the effectiveness of these models largely to the adopted Patch mechanism.<n>We propose a novel and simple Patch-based components (PatchMLP) for LTSF tasks.
arXiv Detail & Related papers (2024-05-22T12:12:20Z) - CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation.
We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales.
We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z) - Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures.
We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks [75.69896269357005]
Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels.
In this paper, we explore how to apply mixup to natural language processing tasks.
We incorporate mixup to transformer-based pre-trained architecture, named "mixup-transformer", for a wide range of NLP tasks.
arXiv Detail & Related papers (2020-10-05T23:37:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.