ORDNet: Capturing Omni-Range Dependencies for Scene Parsing
- URL: http://arxiv.org/abs/2101.03929v1
- Date: Mon, 11 Jan 2021 14:51:11 GMT
- Title: ORDNet: Capturing Omni-Range Dependencies for Scene Parsing
- Authors: Shaofei Huang, Si Liu, Tianrui Hui, Jizhong Han, Bo Li, Jiashi Feng
and Shuicheng Yan
- Abstract summary: We build an Omni-Range Dependencies Network (ORDNet) which can effectively capture short-, middle- and long-range dependencies.
Our ORDNet is able to extract more comprehensive context information and well adapt to complex spatial variance in scene images.
- Score: 135.11360962062957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning to capture dependencies between spatial positions is essential to
many visual tasks, especially the dense labeling problems like scene parsing.
Existing methods can effectively capture long-range dependencies with
self-attention mechanism while short ones by local convolution. However, there
is still much gap between long-range and short-range dependencies, which
largely reduces the models' flexibility in application to diverse spatial
scales and relationships in complicated natural scene images. To fill such a
gap, we develop a Middle-Range (MR) branch to capture middle-range dependencies
by restricting self-attention into local patches. Also, we observe that the
spatial regions which have large correlations with others can be emphasized to
exploit long-range dependencies more accurately, and thus propose a Reweighed
Long-Range (RLR) branch. Based on the proposed MR and RLR branches, we build an
Omni-Range Dependencies Network (ORDNet) which can effectively capture short-,
middle- and long-range dependencies. Our ORDNet is able to extract more
comprehensive context information and well adapt to complex spatial variance in
scene images. Extensive experiments show that our proposed ORDNet outperforms
previous state-of-the-art methods on three scene parsing benchmarks including
PASCAL Context, COCO Stuff and ADE20K, demonstrating the superiority of
capturing omni-range dependencies in deep models for scene parsing task.
Related papers
- CollaMamba: Efficient Collaborative Perception with Cross-Agent Spatial-Temporal State Space Model [12.461378793357705]
Multi-agent collaborative perception fosters a deeper understanding of the environment.
Recent studies on collaborative perception mostly utilize CNNs or Transformers to learn feature representation and fusion in the spatial dimension.
We propose a resource efficient cross-agent spatial-temporal collaborative state space model (SSM), named CollaMamba.
arXiv Detail & Related papers (2024-09-12T02:50:04Z) - DQnet: Cross-Model Detail Querying for Camouflaged Object Detection [54.82390534024954]
A convolutional neural network (CNN) for camouflaged object detection tends to activate local discriminative regions while ignoring complete object extent.
In this paper, we argue that partial activation is caused by the intrinsic characteristics of CNN.
In order to obtain feature maps that could activate full object extent, a novel framework termed Cross-Model Detail Querying network (DQnet) is proposed.
arXiv Detail & Related papers (2022-12-16T06:23:58Z) - Lightweight Long-Range Generative Adversarial Networks [58.16484259508973]
We introduce a novel lightweight generative adversarial networks, which can effectively capture long-range dependencies in the image generation process.
The proposed long-range module can highlight negative relations between pixels, working as a regularization to stabilize training.
Our novel long-range module only introduces few additional parameters and is easily inserted into existing models to capture long-range dependencies.
arXiv Detail & Related papers (2022-09-08T13:05:01Z) - Learning to Aggregate Multi-Scale Context for Instance Segmentation in
Remote Sensing Images [28.560068780733342]
A novel context aggregation network (CATNet) is proposed to improve the feature extraction process.
The proposed model exploits three lightweight plug-and-play modules, namely dense feature pyramid network (DenseFPN), spatial context pyramid ( SCP), and hierarchical region of interest extractor (HRoIE)
arXiv Detail & Related papers (2021-11-22T08:55:25Z) - LSTA-Net: Long short-term Spatio-Temporal Aggregation Network for
Skeleton-based Action Recognition [14.078419675904446]
LSTA-Net: a novel short-term Spatio-Temporal Network.
Long/short-term temporal information is not well explored in existing works.
Experiments were conducted on three public benchmark datasets.
arXiv Detail & Related papers (2021-11-01T10:53:35Z) - An attention-driven hierarchical multi-scale representation for visual
recognition [3.3302293148249125]
Convolutional Neural Networks (CNNs) have revolutionized the understanding of visual content.
We propose a method to capture high-level long-range dependencies by exploring Graph Convolutional Networks (GCNs)
Our approach is simple yet extremely effective in solving both the fine-grained and generic visual classification problems.
arXiv Detail & Related papers (2021-10-23T09:22:22Z) - Global Aggregation then Local Distribution for Scene Parsing [99.1095068574454]
We show that our approach can be modularized as an end-to-end trainable block and easily plugged into existing semantic segmentation networks.
Our approach allows us to build new state of the art on major semantic segmentation benchmarks including Cityscapes, ADE20K, Pascal Context, Camvid and COCO-stuff.
arXiv Detail & Related papers (2021-07-28T03:46:57Z) - Reconstructive Sequence-Graph Network for Video Summarization [107.0328985865372]
Exploiting the inner-shot and inter-shot dependencies is essential for key-shot based video summarization.
We propose a Reconstructive Sequence-Graph Network (RSGN) to encode the frames and shots as sequence and graph hierarchically.
A reconstructor is developed to reward the summary generator, so that the generator can be optimized in an unsupervised manner.
arXiv Detail & Related papers (2021-05-10T01:47:55Z) - Multi-Attention-Network for Semantic Segmentation of Fine Resolution
Remote Sensing Images [10.835342317692884]
The accuracy of semantic segmentation in remote sensing images has been increased significantly by deep convolutional neural networks.
This paper proposes a Multi-Attention-Network (MANet) to address these issues.
A novel attention mechanism of kernel attention with linear complexity is proposed to alleviate the large computational demand in attention.
arXiv Detail & Related papers (2020-09-03T09:08:02Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.