OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels
- URL: http://arxiv.org/abs/2502.20087v1
- Date: Thu, 27 Feb 2025 13:45:15 GMT
- Title: OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels
- Authors: Meng Lou, Yizhou Yu,
- Abstract summary: We propose a novel pure ConvNet vision backbone, termed OverLoCK, which is carefully devised from both the architecture and mixer perspectives.<n>Specifically, we introduce a biomimetic Deep-stage Decomposition Strategy (DDS) that fuses semantically meaningful context representations into middle and deep layers.<n>To fully unleash the power of top-down context guidance, we further propose a novel textbfContext-textbfMixing Dynamic Convolution (ContMix)<n>Our OverLoCK exhibits notable performance improvement over existing methods.
- Score: 50.42092879252807
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the human vision system, top-down attention plays a crucial role in perception, wherein the brain initially performs an overall but rough scene analysis to extract salient cues (i.e., overview first), followed by a finer-grained examination to make more accurate judgments (i.e., look closely next). However, recent efforts in ConvNet designs primarily focused on increasing kernel size to obtain a larger receptive field without considering this crucial biomimetic mechanism to further improve performance. To this end, we propose a novel pure ConvNet vision backbone, termed OverLoCK, which is carefully devised from both the architecture and mixer perspectives. Specifically, we introduce a biomimetic Deep-stage Decomposition Strategy (DDS) that fuses semantically meaningful context representations into middle and deep layers by providing dynamic top-down context guidance at both feature and kernel weight levels. To fully unleash the power of top-down context guidance, we further propose a novel \textbf{Cont}ext-\textbf{Mix}ing Dynamic Convolution (ContMix) that effectively models long-range dependencies while preserving inherent local inductive biases even when the input resolution increases. These properties are absent in previous convolutions. With the support from both DDS and ContMix, our OverLoCK exhibits notable performance improvement over existing methods. For instance, OverLoCK-T achieves a Top-1 accuracy of 84.2\%, significantly surpassing ConvNeXt-B while only using around one-third of the FLOPs/parameters. On object detection with Cascade Mask R-CNN, our OverLoCK-S surpasses MogaNet-B by a significant 1\% in AP$^b$. On semantic segmentation with UperNet, our OverLoCK-T remarkably improves UniRepLKNet-T by 1.7\% in mIoU. Code is publicly available at https://github.com/LMMMEng/OverLoCK.
Related papers
- PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection [65.84604846389624]
We propose PointOBB-v3, a stronger single point-supervised OOD framework.
It generates pseudo rotated boxes without additional priors and incorporates support for the end-to-end paradigm.
Our method achieves an average improvement in accuracy of 3.56% in comparison to previous state-of-the-art methods.
arXiv Detail & Related papers (2025-01-23T18:18:15Z) - Revisiting the Integration of Convolution and Attention for Vision Backbone [59.50256661158862]
Convolutions and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones.
We propose in this work to use MSHAs and Convs in parallel textbfat different granularity levels instead.
We empirically verify the potential of the proposed integration scheme, named textitGLMix: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few semantic slots.
arXiv Detail & Related papers (2024-11-21T18:59:08Z) - SODAWideNet++: Combining Attention and Convolutions for Salient Object Detection [3.2586315449885106]
We propose a novel encoder-decoder-style neural network called SODAWideNet++ designed explicitly for Salient Object Detection.
Inspired by the vision transformers ability to attain a global receptive field from the initial stages, we introduce the Attention Guided Long Range Feature Extraction (AGLRFE) module.
In contrast to the current paradigm of ImageNet pre-training, we modify 118K annotated images from the COCO semantic segmentation dataset by binarizing the annotations to pre-train the proposed model end-to-end.
arXiv Detail & Related papers (2024-08-29T15:51:06Z) - TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic
Token Mixer for Visual Recognition [71.6546914957701]
We propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way.
We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network.
In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost.
arXiv Detail & Related papers (2023-10-30T09:35:56Z) - TOPIQ: A Top-down Approach from Semantics to Distortions for Image
Quality Assessment [53.72721476803585]
Image Quality Assessment (IQA) is a fundamental task in computer vision that has witnessed remarkable progress with deep neural networks.
We propose a top-down approach that uses high-level semantics to guide the IQA network to focus on semantically important local distortion regions.
A key component of our approach is the proposed cross-scale attention mechanism, which calculates attention maps for lower level features.
arXiv Detail & Related papers (2023-08-06T09:08:37Z) - Small but Mighty: Enhancing 3D Point Clouds Semantic Segmentation with
U-Next Framework [7.9395601503353825]
We propose U-Next, a small but mighty framework designed for point cloud semantic segmentation.
We build our U-Next by stacking multiple U-Net $L1$ codecs in a nested and densely arranged manner to minimize the semantic gap.
Extensive experiments conducted on three large-scale benchmarks including S3DIS, Toronto3D, and SensatUrban demonstrate the superiority and the effectiveness of the proposed U-Next architecture.
arXiv Detail & Related papers (2023-04-03T06:59:08Z) - EGRC-Net: Embedding-induced Graph Refinement Clustering Network [66.44293190793294]
We propose a novel graph clustering network called Embedding-Induced Graph Refinement Clustering Network (EGRC-Net)
EGRC-Net effectively utilizes the learned embedding to adaptively refine the initial graph and enhance the clustering performance.
Our proposed methods consistently outperform several state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-19T09:08:43Z) - GraftNet: Towards Domain Generalized Stereo Matching with a
Broad-Spectrum and Task-Oriented Feature [2.610470075814367]
We propose to leverage the feature of a model trained on large-scale datasets to deal with the domain shift.
With the cosine similarity based cost volume as a bridge, the feature will be grafted to an ordinary cost aggregation module.
Experiments show that the model generalization ability can be improved significantly with this broad-spectrum and task-oriented feature.
arXiv Detail & Related papers (2022-04-01T03:10:04Z) - Multi-View Stereo Network with attention thin volume [0.0]
We propose an efficient multi-view stereo (MVS) network for infering depth value from multiple RGB images.
We introduce the self-attention mechanism to fully aggregate the dominant information from input images.
We also introduce the group-wise correlation to feature aggregation, which greatly reduces the memory and calculation burden.
arXiv Detail & Related papers (2021-10-16T11:51:23Z) - FatNet: A Feature-attentive Network for 3D Point Cloud Processing [1.502579291513768]
We introduce a novel feature-attentive neural network layer, a FAT layer, that combines both global point-based features and local edge-based features in order to generate better embeddings.
Our architecture achieves state-of-the-art results on the task of point cloud classification, as demonstrated on the ModelNet40 dataset.
arXiv Detail & Related papers (2021-04-07T23:13:56Z) - PC-RGNN: Point Cloud Completion and Graph Neural Network for 3D Object
Detection [57.49788100647103]
LiDAR-based 3D object detection is an important task for autonomous driving.
Current approaches suffer from sparse and partial point clouds of distant and occluded objects.
In this paper, we propose a novel two-stage approach, namely PC-RGNN, dealing with such challenges by two specific solutions.
arXiv Detail & Related papers (2020-12-18T18:06:43Z) - Regularized Densely-connected Pyramid Network for Salient Instance
Segmentation [73.17802158095813]
We propose a new pipeline for end-to-end salient instance segmentation (SIS)
To better use the rich feature hierarchies in deep networks, we propose the regularized dense connections.
A novel multi-level RoIAlign based decoder is introduced to adaptively aggregate multi-level features for better mask predictions.
arXiv Detail & Related papers (2020-08-28T00:13:30Z) - Perceptron Synthesis Network: Rethinking the Action Scale Variances in
Videos [48.57686258913474]
Video action recognition has been partially addressed by the CNNs stacking of fixed-size 3D kernels.
We propose to learn the optimal-scale kernels from the data.
An textitaction perceptron synthesizer is proposed to generate the kernels from a bag of fixed-size kernels.
arXiv Detail & Related papers (2020-07-22T14:22:29Z) - Dense Hybrid Recurrent Multi-view Stereo Net with Dynamic Consistency
Checking [54.58791377183574]
Our novel hybrid recurrent multi-view stereo net consists of two core modules: 1) a light DRENet (Dense Reception Expanded) module to extract dense feature maps of original size with multi-scale context information, 2) a HU-LSTM (Hybrid U-LSTM) to regularize 3D matching volume into predicted depth map.
Our method exhibits competitive performance to the state-of-the-art method while dramatically reduces memory consumption, which costs only $19.4%$ of R-MVSNet memory consumption.
arXiv Detail & Related papers (2020-07-21T14:59:59Z) - ULSAM: Ultra-Lightweight Subspace Attention Module for Compact
Convolutional Neural Networks [4.143032261649983]
"Ultra-Lightweight Subspace Attention Mechanism" (ULSAM) is end-to-end trainable and can be deployed as a plug-and-play module in compact convolutional neural networks (CNNs)
We achieve $approx$13% and $approx$25% reduction in both the FLOPs and parameter counts of MobileNet-V2 with a 0.27% and more than 1% improvement in top-1 accuracy on the ImageNet-1K and fine-grained image classification datasets (respectively)
arXiv Detail & Related papers (2020-06-26T17:05:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.