Related papers: Search Multilayer Perceptron-Based Fusion for Efficient and Accurate Siamese Tracking

Search Multilayer Perceptron-Based Fusion for Efficient and Accurate Siamese Tracking

URL: http://arxiv.org/abs/2603.01706v1
Date: Mon, 02 Mar 2026 10:30:54 GMT
Title: Search Multilayer Perceptron-Based Fusion for Efficient and Accurate Siamese Tracking
Authors: Tianqi Shen, Huakao Lin, Ning An,
Abstract summary: Multilayer Perception (MLP)-based fusion module enables pixel-level interaction with minimal structural overhead.<n>Differentiable neural architecture search (DNAS) to decouple channel-width optimization from other architectural choices.<n> tracker ranks among the top performers on four general-purpose and three aerial benchmarks.
Score: 3.7727834708902868
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Siamese visual trackers have recently advanced through increasingly sophisticated fusion mechanisms built on convolutional or Transformer architectures. However, both struggle to deliver pixel-level interactions efficiently on resource-constrained hardware, leading to a persistent accuracy-efficiency imbalance. Motivated by this limitation, we redesign the Siamese neck with a simple yet effective Multilayer Perception (MLP)-based fusion module that enables pixel-level interaction with minimal structural overhead. Nevertheless, naively stacking MLP blocks introduces a new challenge: computational cost can scale quadratically with channel width. To overcome this, we construct a hierarchical search space of carefully designed MLP modules and introduce a customized relaxation strategy that enables differentiable neural architecture search (DNAS) to decouple channel-width optimization from other architectural choices. This targeted decoupling automatically balances channel width and depth, yielding a low-complexity architecture. The resulting tracker achieves state-of-the-art accuracy-efficiency trade-offs. It ranks among the top performers on four general-purpose and three aerial tracking benchmarks, while maintaining real-time performance on both resource-constrained Graphics Processing Units (GPUs) and Neural Processing Units (NPUs).

Related papers

MuSASplat: Efficient Sparse-View 3D Gaussian Splats via Lightweight Multi-Scale Adaptation [92.57609195819647]
MuSASplat is a novel framework that dramatically reduces the computational burden of training pose-free feed-forward 3D Gaussian splats models.<n>Central to our approach is a lightweight Multi-Scale Adapter that enables efficient fine-tuning of ViT-based architectures with only a small fraction of training parameters.
arXiv Detail & Related papers (2025-12-08T04:56:46Z)
Rethinking Vision Transformer Depth via Structural Reparameterization [16.12815682992294]
We propose a branch-based structural reparameterization technique that operates during the training phase.<n>Our approach leverages parallel branches within transformer blocks that can be systematically consolidated into streamlined single-path models.<n>When applied to ViT-Tiny, the framework successfully reduces the original 12-layer architecture to 6, 4, or as few as 3 layers while maintaining classification accuracy on ImageNet-1K.
arXiv Detail & Related papers (2025-11-24T21:28:55Z)
MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning [91.90342432541138]
Scaling up model size and training data has advanced foundation models for instance-level perception.<n>High computational cost limits adoption on resource-constrained platforms.<n>We introduce a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
arXiv Detail & Related papers (2025-10-16T18:00:00Z)
Exploring Non-Local Spatial-Angular Correlations with a Hybrid Mamba-Transformer Framework for Light Field Super-Resolution [68.54692184478462]
Mamba-based methods have shown great potential in optimizing both computational cost and performance of light field image super-resolution.<n>We propose a Subspace Simple Scanning (Sub-SS) strategy, based on which we design the Subspace Simple Mamba Block (SSMB) to achieve more efficient and precise feature extraction.<n>We also propose a dual-stage modeling strategy to address the limitation of state space in preserving spatial-angular and disparity information.
arXiv Detail & Related papers (2025-09-05T05:50:38Z)
MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection [4.757840725810513]
YOLO-series models have set strong benchmarks by balancing speed and accuracy.<n>Transformers have high computational complexity because of their self-attention mechanism.<n>We propose MambaNeXt-YOLO, a novel object detection framework that balances accuracy and efficiency.
arXiv Detail & Related papers (2025-06-04T07:46:24Z)
An Efficient and Mixed Heterogeneous Model for Image Restoration [71.85124734060665]
Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas.<n>We propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion.
arXiv Detail & Related papers (2025-04-15T08:19:12Z)
CFMD: Dynamic Cross-layer Feature Fusion for Salient Object Detection [7.262250906929891]
Cross-layer feature pyramid networks (CFPNs) have achieved notable progress in multi-scale feature fusion and boundary detail preservation for salient object detection.<n>To address these challenges, we propose CFMD, a novel cross-layer feature pyramid network that introduces two key innovations.<n>First, we design a context-aware feature aggregation module (CFLMA), which incorporates the state-of-the-art Mamba architecture to construct a dynamic weight distribution mechanism.<n>Second, we introduce an adaptive dynamic upsampling unit (CFLMD) that preserves spatial details during resolution recovery.
arXiv Detail & Related papers (2025-04-02T03:22:36Z)
PointMT: Efficient Point Cloud Analysis with Hybrid MLP-Transformer Architecture [46.266960248570086]
This study tackles the quadratic complexity of the self-attention mechanism by introducing a complexity local attention mechanism for effective feature aggregation. We also introduce a parameter-free channel temperature adaptation mechanism that adaptively adjusts the attention weight distribution in each channel. We show that PointMT achieves performance comparable to state-of-the-art methods while maintaining an optimal balance between performance and accuracy.
arXiv Detail & Related papers (2024-08-10T10:16:03Z)
Real-Time Image Segmentation via Hybrid Convolutional-Transformer Architecture Search [51.89707241449435]
In this paper, we address the challenge of integrating multi-head self-attention into high-resolution representation CNNs efficiently.<n>We develop a multi-target multi-branch supernet method, which fully utilizes the advantages of high-resolution features.<n>We present a series of models via the Hybrid Convolutional-Transformer Architecture Search (HyCTAS) method that searches for the best hybrid combination of light-weight convolution layers and memory-efficient self-attention layers.
arXiv Detail & Related papers (2024-03-15T15:47:54Z)
SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many computation tasks.<n>We show that the dense connections can be replaced with a sparse block diagonal structure that supports larger expansion ratios.<n>We also propose the use of a lightweight, parameter-free, channel covariance attention mechanism as a parallel branch during training.
arXiv Detail & Related papers (2023-12-01T08:22:34Z)
MAXIM: Multi-Axis MLP for Image Processing [19.192826213493838]
We present a multi-axis based architecture, called MAXIM, that can serve as an efficient general-purpose vision backbone for image processing tasks. MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gateds. Results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks.
arXiv Detail & Related papers (2022-01-09T09:59:32Z)
AutoPose: Searching Multi-Scale Branch Aggregation for Pose Estimation [96.29533512606078]
We present AutoPose, a novel neural architecture search(NAS) framework. It is capable of automatically discovering multiple parallel branches of cross-scale connections towards accurate and high-resolution 2D human pose estimation.
arXiv Detail & Related papers (2020-08-16T22:27:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.