CLUENet: Cluster Attention Makes Neural Networks Have Eyes
- URL: http://arxiv.org/abs/2512.06345v1
- Date: Sat, 06 Dec 2025 08:26:36 GMT
- Title: CLUENet: Cluster Attention Makes Neural Networks Have Eyes
- Authors: Xiangshuai Song, Jun-Jie Huang, Tianrui Liu, Ke Liang, Chang Tang,
- Abstract summary: Clustering paradigms offer promising interpretability and flexible semantic modeling.<n>We propose CLUster attEntion Network (CLUENet), a transparent deep architecture for visual semantic understanding.
- Score: 25.43808812298579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the success of convolution- and attention-based models in vision tasks, their rigid receptive fields and complex architectures limit their ability to model irregular spatial patterns and hinder interpretability, therefore posing challenges for tasks requiring high model transparency. Clustering paradigms offer promising interpretability and flexible semantic modeling, but suffer from limited accuracy, low efficiency, and gradient vanishing during training. To address these issues, we propose CLUster attEntion Network (CLUENet), an transparent deep architecture for visual semantic understanding. We propose three key innovations include (i) a Global Soft Aggregation and Hard Assignment with a Temperature-Scaled Cosin Attention and gated residual connections for enhanced local modeling, (ii) inter-block Hard and Shared Feature Dispatching, and (iii) an improved cluster pooling strategy. These enhancements significantly improve both classification performance and visual interpretability. Experiments on CIFAR-100 and Mini-ImageNet demonstrate that CLUENet outperforms existing clustering methods and mainstream visual models, offering a compelling balance of accuracy, efficiency, and transparency.
Related papers
- Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z) - Hierarchical Graph Feature Enhancement with Adaptive Frequency Modulation for Visual Recognition [6.580655899524989]
Convolutional neural networks (CNNs) have demonstrated strong performance in visual recognition tasks.<n>We propose a novel framework that integrates graph-based rea soning into CNNs to enhance both structural awareness and feature representation.<n>The proposed HGFE module is lightweight, end-to-end trainable, and can be seamlessly integrated into standard CNN backbone networks.
arXiv Detail & Related papers (2025-08-15T14:19:50Z) - LeMoRe: Learn More Details for Lightweight Semantic Segmentation [48.81126061219231]
We introduce an efficient paradigm by synergizing explicit and implicit modeling to balance computational efficiency with representational fidelity.<n>Our method combines well-defined Cartesian directions with explicitly modeled views and implicitly inferred intermediate representations, efficiently capturing global dependencies.
arXiv Detail & Related papers (2025-05-29T04:55:10Z) - LSNet: See Large, Focus Small [67.05569159984691]
We introduce LS (textbfLarge-textbfSmall) convolution, which combines large- kernel perception and small- kernel aggregation.<n>LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks.
arXiv Detail & Related papers (2025-03-29T16:00:54Z) - Enhancing Interpretability Through Loss-Defined Classification Objective in Structured Latent Spaces [5.2542280870644715]
We introduce Latent Boost, a novel approach that integrates advanced distance metric learning into supervised classification tasks.<n>Latent Boost improves classification interpretability, as demonstrated by higher Silhouette scores, while accelerating training convergence.
arXiv Detail & Related papers (2024-12-11T16:25:17Z) - Point Cloud Understanding via Attention-Driven Contrastive Learning [64.65145700121442]
Transformer-based models have advanced point cloud understanding by leveraging self-attention mechanisms.
PointACL is an attention-driven contrastive learning framework designed to address these limitations.
Our method employs an attention-driven dynamic masking strategy that guides the model to focus on under-attended regions.
arXiv Detail & Related papers (2024-11-22T05:41:00Z) - Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - Interpreting and Improving Attention From the Perspective of Large Kernel Convolution [51.06461246235176]
We introduce Large Kernel Convolutional Attention (LKCA), a novel formulation that reinterprets attention operations as a single large- Kernel convolution.<n>LKCA achieves competitive performance across various visual tasks, particularly in data-constrained settings.
arXiv Detail & Related papers (2024-01-11T08:40:35Z) - Grid Jigsaw Representation with CLIP: A New Perspective on Image Clustering [33.05984601411495]
We propose a new perspective on image clustering, the pretrain-based Grid Jigsaw Representation (pGJR)<n>Inspired by human jigsaw puzzle processing, we modify the traditional jigsaw learning to gain a more sequential and incremental understanding of image structure.<n>Our experiments demonstrate that using the pretrained model as a feature extractor can accelerate the convergence of clustering.
arXiv Detail & Related papers (2023-10-27T03:07:05Z) - A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers.
Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module.
Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.