AutoFocusFormer: Image Segmentation off the Grid
- URL: http://arxiv.org/abs/2304.12406v2
- Date: Wed, 25 Oct 2023 22:28:12 GMT
- Title: AutoFocusFormer: Image Segmentation off the Grid
- Authors: Chen Ziwen, Kaushik Patnaik, Shuangfei Zhai, Alvin Wan, Zhile Ren,
Alex Schwing, Alex Colburn, Li Fuxin
- Abstract summary: AutoFocusFormer (AFF) is a local-attention transformer image recognition backbone.
We develop a novel point-based local attention block, facilitated by a balanced clustering module.
Experiments show that our AutoFocusFormer (AFF) improves significantly over baseline models of similar sizes.
- Score: 11.257993284839621
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real world images often have highly imbalanced content density. Some areas
are very uniform, e.g., large patches of blue sky, while other areas are
scattered with many small objects. Yet, the commonly used successive grid
downsampling strategy in convolutional deep networks treats all areas equally.
Hence, small objects are represented in very few spatial locations, leading to
worse results in tasks such as segmentation. Intuitively, retaining more pixels
representing small objects during downsampling helps to preserve important
information. To achieve this, we propose AutoFocusFormer (AFF), a
local-attention transformer image recognition backbone, which performs adaptive
downsampling by learning to retain the most important pixels for the task.
Since adaptive downsampling generates a set of pixels irregularly distributed
on the image plane, we abandon the classic grid structure. Instead, we develop
a novel point-based local attention block, facilitated by a balanced clustering
module and a learnable neighborhood merging module, which yields
representations for our point-based versions of state-of-the-art segmentation
heads. Experiments show that our AutoFocusFormer (AFF) improves significantly
over baseline models of similar sizes.
Related papers
- Exploring Multi-view Pixel Contrast for General and Robust Image Forgery Localization [4.8454936010479335]
We propose a Multi-view Pixel-wise Contrastive algorithm (MPC) for image forgery localization.
Specifically, we first pre-train the backbone network with the supervised contrastive loss.
Then the localization head is fine-tuned using the cross-entropy loss, resulting in a better pixel localizer.
arXiv Detail & Related papers (2024-06-19T13:51:52Z) - Pixel-Inconsistency Modeling for Image Manipulation Localization [59.968362815126326]
Digital image forensics plays a crucial role in image authentication and manipulation localization.
This paper presents a generalized and robust manipulation localization model through the analysis of pixel inconsistency artifacts.
Experiments show that our method successfully extracts inherent pixel-inconsistency forgery fingerprints.
arXiv Detail & Related papers (2023-09-30T02:54:51Z) - Hi-ResNet: Edge Detail Enhancement for High-Resolution Remote Sensing Segmentation [10.919956120261539]
High-resolution remote sensing (HRS) semantic segmentation extracts key objects from high-resolution coverage areas.
objects of the same category within HRS images show significant differences in scale and shape across diverse geographical environments.
We propose a High-resolution remote sensing network (Hi-ResNet) with efficient network structure designs.
arXiv Detail & Related papers (2023-05-22T03:58:25Z) - AF$_2$: Adaptive Focus Framework for Aerial Imagery Segmentation [86.44683367028914]
Aerial imagery segmentation has some unique challenges, the most critical one among which lies in foreground-background imbalance.
We propose Adaptive Focus Framework (AF$), which adopts a hierarchical segmentation procedure and focuses on adaptively utilizing multi-scale representations.
AF$ has significantly improved the accuracy on three widely used aerial benchmarks, as fast as the mainstream method.
arXiv Detail & Related papers (2022-02-18T10:14:45Z) - Learning to Aggregate Multi-Scale Context for Instance Segmentation in
Remote Sensing Images [28.560068780733342]
A novel context aggregation network (CATNet) is proposed to improve the feature extraction process.
The proposed model exploits three lightweight plug-and-play modules, namely dense feature pyramid network (DenseFPN), spatial context pyramid ( SCP), and hierarchical region of interest extractor (HRoIE)
arXiv Detail & Related papers (2021-11-22T08:55:25Z) - Learning to Downsample for Segmentation of Ultra-High Resolution Images [6.432524678252553]
We show that learning the spatially varying downsampling strategy jointly with segmentation offers advantages in segmenting large images with limited computational budget.
Our method adapts the sampling density over different locations so that more samples are collected from the small important regions and less from the others.
We show on two public and one local high-resolution datasets that our method consistently learns sampling locations preserving more information and boosting segmentation accuracy over baseline methods.
arXiv Detail & Related papers (2021-09-22T23:04:59Z) - Pixel-Perfect Structure-from-Motion with Featuremetric Refinement [96.73365545609191]
We refine two key steps of structure-from-motion by a direct alignment of low-level image information from multiple views.
This significantly improves the accuracy of camera poses and scene geometry for a wide range of keypoint detectors.
Our system easily scales to large image collections, enabling pixel-perfect crowd-sourced localization at scale.
arXiv Detail & Related papers (2021-08-18T17:58:55Z) - AINet: Association Implantation for Superpixel Segmentation [82.21559299694555]
We propose a novel textbfAssociation textbfImplantation (AI) module to enable the network to explicitly capture the relations between the pixel and its surrounding grids.
Our method could not only achieve state-of-the-art performance but maintain satisfactory inference efficiency.
arXiv Detail & Related papers (2021-01-26T10:40:13Z) - An Empirical Method to Quantify the Peripheral Performance Degradation
in Deep Networks [18.808132632482103]
convolutional neural network (CNN) kernels compound with each convolutional layer.
Deeper and deeper networks combined with stride-based down-sampling means that the propagation of this region can end up covering a non-negligable portion of the image.
Our dataset is constructed by inserting objects into high resolution backgrounds, thereby allowing us to crop sub-images which place target objects at specific locations relative to the image border.
By probing the behaviour of Mask R-CNN across a selection of target locations, we see clear patterns of performance degredation near the image boundary, and in particular in the image corners.
arXiv Detail & Related papers (2020-12-04T18:00:47Z) - Learning Enriched Features for Real Image Restoration and Enhancement [166.17296369600774]
convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task.
We present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network.
Our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
arXiv Detail & Related papers (2020-03-15T11:04:30Z) - Image Fine-grained Inpainting [89.17316318927621]
We present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields.
To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss.
We also employ a discriminator with local and global branches to ensure local-global contents consistency.
arXiv Detail & Related papers (2020-02-07T03:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.