Related papers: TwistNet-2D: Learning Second-Order Channel Interactions via Spiral Twisting for Texture Recognition

TwistNet-2D: Learning Second-Order Channel Interactions via Spiral Twisting for Texture Recognition

URL: http://arxiv.org/abs/2602.07262v2
Date: Tue, 10 Feb 2026 23:43:51 GMT
Title: TwistNet-2D: Learning Second-Order Channel Interactions via Spiral Twisting for Texture Recognition
Authors: Junbo Jacob Lian, Feng Xiong, Yujun Sun, Kaichen Ouyang, Zong Ke, Mingyang Yu, Shengwei Fu, Zhong Rui, Zhang Yujun, Huiling Chen,
Abstract summary: We introduce TwistNet-2D, a lightweight module that computes emphlocal pairwise channel products under directional spatial displacement.<n>The core component, Spiral-Twisted Channel Interaction (STCI), shifts one feature map along a prescribed direction before element-wise channel multiplication.
Score: 8.911086692137593
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Second-order feature statistics are central to texture recognition, yet current methods face a fundamental tension: bilinear pooling and Gram matrices capture global channel correlations but collapse spatial structure, while self-attention models spatial context through weighted aggregation rather than explicit pairwise feature interactions. We introduce TwistNet-2D, a lightweight module that computes \emph{local} pairwise channel products under directional spatial displacement, jointly encoding where features co-occur and how they interact. The core component, Spiral-Twisted Channel Interaction (STCI), shifts one feature map along a prescribed direction before element-wise channel multiplication, thereby capturing the cross-position co-occurrence patterns characteristic of structured and periodic textures. Aggregating four directional heads with learned channel reweighting and injecting the result through a sigmoid-gated residual path, \TwistNet incurs only 3.5% additional parameters and 2% additional FLOPs over ResNet-18, yet consistently surpasses both parameter-matched and substantially larger baselines -- including ConvNeXt, Swin Transformer, and hybrid CNN--Transformer architectures -- across four texture and fine-grained recognition benchmarks.

Related papers

DisentangleFormer: Spatial-Channel Decoupling for Multi-Channel Vision [10.378378296066305]
Vision Transformers face a fundamental limitation: standard self-attention jointly processes spatial and channel dimensions.<n>We propose DisentangleFormer, an architecture that achieves robust multi-channel vision representation through principled spatial-channel decoupling.<n>Our design integrates three core components: (1) Parallel Disentanglement: Independently processes spatial-token and channel-token streams, enabling decorrelated feature learning across spatial and spectral dimensions, (2) Squeezed Token Enhancer: An adaptive calibration module that dynamically fuses spatial and channel streams, and (3) Multi-Scale FFN: complementing global attention with multi-scale local context.
arXiv Detail & Related papers (2025-12-03T23:03:56Z)
Region-Point Joint Representation for Effective Trajectory Similarity Learning [25.664203648334563]
textbfRePo is a novel method that encodes textbfRegion-wise and textbfPoint-wise features to capture both spatial context and fine-grained moving patterns.<n>Experiment results show that RePo achieves an average accuracy improvement of 22.2% over SOTA baselines.
arXiv Detail & Related papers (2025-11-17T08:28:18Z)
TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture. To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer. In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z)
TransUPR: A Transformer-based Uncertain Point Refiner for LiDAR Point Cloud Semantic Segmentation [6.587305905804226]
We propose a transformer-based uncertain point refiner, i.e., TransUPR, to refine selected uncertain points in a learnable manner. Our TransUPR achieves state-of-the-art performance, i.e., 68.2% mean Intersection over Union (mIoU) on the Semantic KITTI benchmark.
arXiv Detail & Related papers (2023-02-16T21:38:36Z)
Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets. We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z)
Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects. The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z)
DaViT: Dual Attention Vision Transformers [94.62855697081079]
We introduce Dual Attention Vision Transformers (DaViT) DaViT is a vision transformer architecture that is able to capture global context while maintaining computational efficiency. We show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations.
arXiv Detail & Related papers (2022-04-07T17:59:32Z)
Gate-Shift-Fuse for Video Action Recognition [43.8525418821458]
Gate-Fuse (GSF) is a novel-temporal feature extraction module which controls interactions in-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner. GSF can be inserted into existing 2D CNNs to convert them into efficient and high performing, with negligible parameter and compute overhead. We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks.
arXiv Detail & Related papers (2022-03-16T19:19:04Z)
PlueckerNet: Learn to Register 3D Line Reconstructions [57.20244406275875]
This paper proposes a neural network based method to solve the problem of Aligning two partially-overlapped 3D line reconstructions in Euclidean space. Experiments on both indoor and outdoor datasets show that the registration (rotation and translation) precision of our method outperforms baselines significantly.
arXiv Detail & Related papers (2020-12-02T11:31:56Z)
Volumetric Transformer Networks [88.85542905676712]
We introduce a learnable module, the volumetric transformer network (VTN) VTN predicts channel-wise warping fields so as to reconfigure intermediate CNN features spatially and channel-wisely. Our experiments show that VTN consistently boosts the features' representation power and consequently the networks' accuracy on fine-grained image recognition and instance-level image retrieval.
arXiv Detail & Related papers (2020-07-18T14:00:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.