Related papers: Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis

Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis

URL: http://arxiv.org/abs/2403.18063v2
Date: Mon, 3 Jun 2024 18:22:30 GMT
Title: Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis
Authors: Badri N. Patro, Suhas Ranganath, Vinay P. Namboodiri, Vijay S. Agneeswaran,
Abstract summary: Heracles is a novel SSM that integrates a local SSM, a global SSM, and an attention-based token interaction module. Heracles achieves state-of-the-art performance on the ImageNet dataset with 84.5% top-1 accuracy. Heracles excels in transfer learning tasks on datasets such as CIFAR-10, CIFAR-100, Oxford Flowers, and Stanford Cars.
Score: 23.511807886483087
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Transformers have revolutionized image modeling tasks with adaptations like DeIT, Swin, SVT, Biformer, STVit, and FDVIT. However, these models often face challenges with inductive bias and high quadratic complexity, making them less efficient for high-resolution images. State space models (SSMs) such as Mamba, V-Mamba, ViM, and SiMBA offer an alternative to handle high resolution images in computer vision tasks. These SSMs encounter two major issues. First, they become unstable when scaled to large network sizes. Second, although they efficiently capture global information in images, they inherently struggle with handling local information. To address these challenges, we introduce Heracles, a novel SSM that integrates a local SSM, a global SSM, and an attention-based token interaction module. Heracles leverages a Hartely kernel-based state space model for global image information, a localized convolutional network for local details, and attention mechanisms in deeper layers for token interactions. Our extensive experiments demonstrate that Heracles-C-small achieves state-of-the-art performance on the ImageNet dataset with 84.5\% top-1 accuracy. Heracles-C-Large and Heracles-C-Huge further improve accuracy to 85.9\% and 86.4\%, respectively. Additionally, Heracles excels in transfer learning tasks on datasets such as CIFAR-10, CIFAR-100, Oxford Flowers, and Stanford Cars, and in instance segmentation on the MSCOCO dataset. Heracles also proves its versatility by achieving state-of-the-art results on seven time-series datasets, showcasing its ability to generalize across domains with spectral data, capturing both local and global information. The project page is available at this link.\url{https://github.com/badripatro/heracles}

Related papers

DAMamba: Vision State Space Model with Dynamic Adaptive Scan [51.81060691414399]
State space models (SSMs) have recently garnered significant attention in computer vision. We propose Dynamic Adaptive Scan (DAS), a data-driven method that adaptively allocates scanning orders and regions. Based on DAS, we propose the vision backbone DAMamba, which significantly outperforms current state-of-the-art vision Mamba models in vision tasks.
arXiv Detail & Related papers (2025-02-18T08:12:47Z)
Global Semantic-Guided Sub-image Feature Weight Allocation in High-Resolution Large Vision-Language Models [50.98559225639266]
Sub-images with higher semantic relevance to the entire image encapsulate richer visual information for preserving the model's visual understanding ability. Global Semantic-guided Weight Allocator (GSWA) module allocates weights to sub-images based on their relative information density. SleighVL, a lightweight yet high-performing model, outperforms models with comparable parameters and remains competitive with larger models.
arXiv Detail & Related papers (2025-01-24T06:42:06Z)
Microscopic-Mamba: Revealing the Secrets of Microscopic Images with Just 4M Parameters [12.182070604073585]
CNNs struggle with modeling long-range dependencies, limiting their ability to fully utilize semantic information in images. Transformers are hampered by the complexity of quadratic computations. We propose a model based on the Mamba architecture: Microscopic-Mamba.
arXiv Detail & Related papers (2024-09-12T10:01:33Z)
LoG-VMamba: Local-Global Vision Mamba for Medical Image Segmentation [0.9831489366502301]
Mamba, a State Space Model, has recently shown competitive performance to Convolutional Neural Networks (CNNs) and Transformers. Various attempts have been made to adapt Mamba to Computer Vision tasks, including medical image segmentation (MIS)
arXiv Detail & Related papers (2024-08-26T17:02:25Z)
LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba [54.85262314960038]
Local Attentional Mamba blocks capture both global contexts and local details with linear complexity. Our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution. Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs.
arXiv Detail & Related papers (2024-08-05T16:39:39Z)
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models [44.437693135170576]
We propose a new framework, LMM with Sophisticated Tasks, Local image compression, and Mixture of global Experts (SliME) We extract contextual information from the global view using a mixture of adapters, based on the observation that different adapters excel at different tasks. The proposed method achieves leading performance across various benchmarks with only 2 million training data.
arXiv Detail & Related papers (2024-06-12T17:59:49Z)
Vision Transformer with Sparse Scan Prior [57.37893387775829]
Inspired by the human eye's sparse scanning mechanism, we propose a textbfSparse textbfScan textbfSelf-textbfAttention mechanism. This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors. Building on $rmS3rmA$, we introduce the textbfSparse textbfScan textbfVision
arXiv Detail & Related papers (2024-05-22T04:34:36Z)
RS-Mamba for Large Remote Sensing Image Dense Prediction [58.12667617617306]
We propose the Remote Sensing Mamba (RSM) for dense prediction tasks in large VHR remote sensing images. RSM is specifically designed to capture the global context of remote sensing images with linear complexity. Our model achieves better efficiency and accuracy than transformer-based models on large remote sensing images.
arXiv Detail & Related papers (2024-04-03T12:06:01Z)
xT: Nested Tokenization for Larger Context in Large Images [79.37673340393475]
xT is a framework for vision transformers which aggregates global context with local details. We are able to increase accuracy by up to 8.6% on challenging classification tasks.
arXiv Detail & Related papers (2024-03-04T10:29:58Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
DLGSANet: Lightweight Dynamic Local and Global Self-Attention Networks for Image Super-Resolution [83.47467223117361]
We propose an effective lightweight dynamic local and global self-attention network (DLGSANet) to solve image super-resolution. Motivated by the network designs of Transformers, we develop a simple yet effective multi-head dynamic local self-attention (MHDLSA) module to extract local features efficiently. To overcome this problem, we develop a sparse global self-attention (SparseGSA) module to select the most useful similarity values.
arXiv Detail & Related papers (2023-01-05T12:06:47Z)
Boosting Few-shot Semantic Segmentation with Transformers [81.43459055197435]
TRansformer-based Few-shot Semantic segmentation method (TRFS) Our model consists of two modules: Global Enhancement Module (GEM) and Local Enhancement Module (LEM)
arXiv Detail & Related papers (2021-08-04T20:09:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.