Polarized Self-Attention: Towards High-quality Pixel-wise Regression
- URL: http://arxiv.org/abs/2107.00782v1
- Date: Fri, 2 Jul 2021 01:03:11 GMT
- Title: Polarized Self-Attention: Towards High-quality Pixel-wise Regression
- Authors: Huajun Liu, Fuqiang Liu, Xinyi Fan, Dong Huang
- Abstract summary: This paper presents the Polarized Self-Attention(PSA) block that incorporates two critical designs towards high-quality pixel-wise regression.
Experimental results show that PSA boosts standard baselines by $2-4$ points, and boosts state-of-the-arts by $1-2$ points on 2D pose estimation and semantic segmentation benchmarks.
- Score: 19.2303932008785
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Pixel-wise regression is probably the most common problem in fine-grained
computer vision tasks, such as estimating keypoint heatmaps and segmentation
masks. These regression problems are very challenging particularly because they
require, at low computation overheads, modeling long-range dependencies on
high-resolution inputs/outputs to estimate the highly nonlinear pixel-wise
semantics. While attention mechanisms in Deep Convolutional Neural
Networks(DCNNs) has become popular for boosting long-range dependencies,
element-specific attention, such as Nonlocal blocks, is highly complex and
noise-sensitive to learn, and most of simplified attention hybrids try to reach
the best compromise among multiple types of tasks. In this paper, we present
the Polarized Self-Attention(PSA) block that incorporates two critical designs
towards high-quality pixel-wise regression: (1) Polarized filtering: keeping
high internal resolution in both channel and spatial attention computation
while completely collapsing input tensors along their counterpart dimensions.
(2) Enhancement: composing non-linearity that directly fits the output
distribution of typical fine-grained regression, such as the 2D Gaussian
distribution (keypoint heatmaps), or the 2D Binormial distribution (binary
segmentation masks). PSA appears to have exhausted the representation capacity
within its channel-only and spatial-only branches, such that there is only
marginal metric differences between its sequential and parallel layouts.
Experimental results show that PSA boosts standard baselines by $2-4$ points,
and boosts state-of-the-arts by $1-2$ points on 2D pose estimation and semantic
segmentation benchmarks.
Related papers
- Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation [15.30336007288786]
PVT-GDLA is a decoder-centric Transformer that restores sharp, long-range dependencies at linear time.<n>It achieves state-of-the-art accuracy across CT, MRI, ultrasound, and dermoscopy benchmarks under equal training budgets.
arXiv Detail & Related papers (2026-03-03T08:26:08Z) - LINA: Linear Autoregressive Image Generative Models with Continuous Tokens [56.80443965097921]
Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis.<n>We study how to design compute-efficient linear attention within this framework.<n>We present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions.
arXiv Detail & Related papers (2026-01-30T06:44:33Z) - Cross-Layer Attentive Feature Upsampling for Low-latency Semantic Segmentation [52.01210390327581]
We propose Guided Attentive Interpolation (GAI) to adaptively interpolate fine-grained high-resolution features with semantic features.<n>GAI determines both spatial and semantic relations of pixels from features of different resolutions and then leverages these relations to interpolate high-resolution features with rich semantics.<n>In experiments, the GAI-based semantic segmentation networks, i.e., GAIN, can achieve78.8 mIoU with 22.3 FPS on Cityscapes and 80.6 mIoU with 64.5 on CamVid using an NVIDIA 1080Ti GPU.
arXiv Detail & Related papers (2026-01-03T12:09:49Z) - HyperTopo-Adapters: Geometry- and Topology-Aware Segmentation of Leaf Lesions on Frozen Encoders [0.14323566945483493]
Leaf-lesion segmentation is topology-sensitive; small merges, splits, or false holes can be meaningful descriptors of biochemical pathways.<n>I explore HyperTopo-Adapters, a lightweight, parameter-efficient head trained on top of a frozen vision encoder.<n>Early results show consistent gains in boundary and topology metrics on a Kaggle leaf-lesion dataset.
arXiv Detail & Related papers (2025-12-29T04:27:26Z) - GSPN-2: Efficient Parallel Sequence Modeling [101.33780567131716]
Generalized Spatial Propagation Network (GSPN) addresses this by replacing quadratic self-attention with a line-scan propagation scheme.<n>GSPN-2 establishes a new efficiency frontier for modeling global spatial context in vision applications.
arXiv Detail & Related papers (2025-11-28T07:26:45Z) - Saccadic Vision for Fine-Grained Visual Classification [10.681604440788854]
Fine-grained visual classification (FGVC) requires distinguishing between visually similar categories through subtle, localized features.<n>Existing part-based methods rely on complex localization networks that learn mappings from pixel to sample space.<n>We propose a two-stage process that first extracts peripheral features and generates a sample map.<n>We employ contextualized selective attention to weigh the impact of each fixation patch before fusing peripheral and focus representations.
arXiv Detail & Related papers (2025-09-19T07:03:37Z) - Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective [47.87649021414188]
We present LASADGen, an autoregressive image generator that enables selective attention to relevant spatial contexts with linear complexity.<n>Experiments on ImageNet show LASADGen achieves state-of-the-art image generation performance and computational efficiency.
arXiv Detail & Related papers (2025-07-02T12:27:06Z) - Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation [57.56385490252605]
Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention.<n>We propose SVG2, a training-free framework that maximizes identification accuracy and computation minimizes waste.
arXiv Detail & Related papers (2025-05-24T21:30:29Z) - UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler [62.06785782635153]
We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains.
UniDepthV2 directly predicts metric 3D points from the input image at inference time without any additional information.
Our model exploits a pseudo-spherical output representation, which disentangles the camera and depth representations.
arXiv Detail & Related papers (2025-02-27T14:03:15Z) - Project-and-Fuse: Improving RGB-D Semantic Segmentation via Graph Convolution Networks [21.713293775719414]
We propose to fuse features from two modalities in a late fusion style, during which the geometric feature injection is guided by texture feature prior.
At the 3D feature extraction stage, we argue that traditional CNNs are not efficient enough for depth maps.
At projection matrix generation stage, we find the existence of Biased-Assignment and Ambiguous-Locality issues in the original pipeline.
arXiv Detail & Related papers (2025-01-31T02:24:13Z) - A Simple and Generalist Approach for Panoptic Segmentation [57.94892855772925]
Generalist vision models aim for one and the same architecture for a variety of vision tasks.
While such shared architecture may seem attractive, generalist models tend to be outperformed by their bespoken counterparts.
We address this problem by introducing two key contributions, without compromising the desirable properties of generalist models.
arXiv Detail & Related papers (2024-08-29T13:02:12Z) - Fast Point Cloud Geometry Compression with Context-based Residual Coding and INR-based Refinement [19.575833741231953]
We use the KNN method to determine the neighborhoods of raw surface points.
A conditional probability model is adaptive to local geometry, leading to significant rate reduction.
We incorporate an implicit neural representation into the refinement layer, allowing the decoder to sample points on the underlying surface at arbitrary densities.
arXiv Detail & Related papers (2024-08-06T05:24:06Z) - Double-Shot 3D Shape Measurement with a Dual-Branch Network [14.749887303860717]
We propose a dual-branch Convolutional Neural Network (CNN)-Transformer network (PDCNet) to process different structured light (SL) modalities.
Within PDCNet, a Transformer branch is used to capture global perception in the fringe images, while a CNN branch is designed to collect local details in the speckle images.
We show that our method can reduce fringe order ambiguity while producing high-accuracy results on a self-made dataset.
arXiv Detail & Related papers (2024-07-19T10:49:26Z) - Learning Feature Matching via Matchable Keypoint-Assisted Graph Neural
Network [52.29330138835208]
Accurately matching local features between a pair of images is a challenging computer vision task.
Previous studies typically use attention based graph neural networks (GNNs) with fully-connected graphs over keypoints within/across images.
We propose MaKeGNN, a sparse attention-based GNN architecture which bypasses non-repeatable keypoints and leverages matchable ones to guide message passing.
arXiv Detail & Related papers (2023-07-04T02:50:44Z) - High-fidelity Pseudo-labels for Boosting Weakly-Supervised Segmentation [17.804090651425955]
Image-level weakly-supervised segmentation (WSSS) reduces the usually vast data annotation cost by surrogate segmentation masks during training.
Our work is based on two techniques for improving CAMs; importance sampling, which is a substitute for GAP, and the feature similarity loss.
We reformulate both techniques based on binomial posteriors of multiple independent binary problems.
This has two benefits; their performance is improved and they become more general, resulting in an add-on method that can boost virtually any WSSS method.
arXiv Detail & Related papers (2023-04-05T17:43:57Z) - Lesion-aware Dynamic Kernel for Polyp Segmentation [49.63274623103663]
We propose a lesion-aware dynamic network (LDNet) for polyp segmentation.
It is a traditional u-shape encoder-decoder structure incorporated with a dynamic kernel generation and updating scheme.
This simple but effective scheme endows our model with powerful segmentation performance and generalization capability.
arXiv Detail & Related papers (2023-01-12T09:53:57Z) - Decoupled Multi-task Learning with Cyclical Self-Regulation for Face
Parsing [71.19528222206088]
We propose a novel Decoupled Multi-task Learning with Cyclical Self-Regulation for face parsing.
Specifically, DML-CSR designs a multi-task model which comprises face parsing, binary edge, and category edge detection.
Our method achieves the new state-of-the-art performance on the Helen, CelebA-HQ, and LapaMask datasets.
arXiv Detail & Related papers (2022-03-28T02:12:30Z) - Sparse Cross-scale Attention Network for Efficient LiDAR Panoptic
Segmentation [12.61753274984776]
We present SCAN, a novel sparse cross-scale attention network to align multi-scale sparse features with global voxel-encoded attention to capture the long-range relationship of instance context.
For the surface-aggregated points, SCAN adopts a novel sparse class-agnostic representation of instance centroids, which can not only maintain the sparsity of aligned features, but also reduce the amount of the network through sparse convolution.
arXiv Detail & Related papers (2022-01-16T05:34:54Z) - Augmenting Convolutional networks with attention-based aggregation [55.97184767391253]
We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning.
We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth)
It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption.
arXiv Detail & Related papers (2021-12-27T14:05:41Z) - Pixel-in-Pixel Net: Towards Efficient Facial Landmark Detection in the
Wild [104.61677518999976]
We propose Pixel-in-Pixel Net (PIPNet) for facial landmark detection.
The proposed model is equipped with a novel detection head based on heatmap regression.
To further improve the cross-domain generalization capability of PIPNet, we propose self-training with curriculum.
arXiv Detail & Related papers (2020-03-08T12:23:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.