ISNet: Integrate Image-Level and Semantic-Level Context for Semantic
  Segmentation
        - URL: http://arxiv.org/abs/2108.12382v1
- Date: Fri, 27 Aug 2021 16:38:22 GMT
- Title: ISNet: Integrate Image-Level and Semantic-Level Context for Semantic
  Segmentation
- Authors: Zhenchao Jin, Bin Liu, Qi Chu, Nenghai Yu
- Abstract summary: Co-occurrent visual pattern makes aggregating contextual information a common paradigm to enhance the pixel representation for semantic image segmentation.
Existing approaches focus on modeling the context from the perspective of the whole image, i.e., aggregating the image-level contextual information.
This paper proposes to augment the pixel representations by aggregating the image-level and semantic-level contextual information.
- Score: 64.56511597220837
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Co-occurrent visual pattern makes aggregating contextual information a common
paradigm to enhance the pixel representation for semantic image segmentation.
The existing approaches focus on modeling the context from the perspective of
the whole image, i.e., aggregating the image-level contextual information.
Despite impressive, these methods weaken the significance of the pixel
representations of the same category, i.e., the semantic-level contextual
information. To address this, this paper proposes to augment the pixel
representations by aggregating the image-level and semantic-level contextual
information, respectively. First, an image-level context module is designed to
capture the contextual information for each pixel in the whole image. Second,
we aggregate the representations of the same category for each pixel where the
category regions are learned under the supervision of the ground-truth
segmentation. Third, we compute the similarities between each pixel
representation and the image-level contextual information, the semantic-level
contextual information, respectively. At last, a pixel representation is
augmented by weighted aggregating both the image-level contextual information
and the semantic-level contextual information with the similarities as the
weights. Integrating the image-level and semantic-level context allows this
paper to report state-of-the-art accuracy on four benchmarks, i.e., ADE20K,
LIP, COCOStuff and Cityscapes.
 
      
        Related papers
        - DFEN: Dual Feature Equalization Network for Medical Image Segmentation [9.091452460153672]
 We propose a dual feature equalization network based on the hybrid architecture of Swin Transformer and Convolutional Neural Network.<n>Swin Transformer is utilized as both the encoder and decoder, thereby bolstering the ability of the model to capture long-range dependencies and spatial correlations.
 arXiv  Detail & Related papers  (2025-05-09T09:38:43Z)
- Benchmarking Large Vision-Language Models via Directed Scene Graph for   Comprehensive Image Captioning [77.2852342808769]
 In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view.
We first manually segment the image into semantically meaningful regions according to common-object vocabulary, while also distinguishing attributes of objects within all those regions.
Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image.
 arXiv  Detail & Related papers  (2024-12-11T18:37:42Z)
- Hierarchical Open-vocabulary Universal Image Segmentation [48.008887320870244]
 Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions.
We propose a decoupled text-image fusion mechanism and representation learning modules for both "things" and "stuff"
Our resulting model, named HIPIE tackles, HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a unified framework.
 arXiv  Detail & Related papers  (2023-07-03T06:02:15Z)
- MCIBI++: Soft Mining Contextual Information Beyond Image for Semantic
  Segmentation [29.458735435545048]
 We propose a novel soft mining contextual information beyond image paradigm named MCIBI++.
We generate a class probability distribution for each pixel representation and conduct the dataset-level context aggregation.
In the inference phase, we additionally design a coarse-to-fine iterative inference strategy to further boost the segmentation results.
 arXiv  Detail & Related papers  (2022-09-09T18:03:52Z)
- CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
 We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
 arXiv  Detail & Related papers  (2021-11-30T07:29:08Z)
- Mining Contextual Information Beyond Image for Semantic Segmentation [37.783233906684444]
 The paper studies the context aggregation problem in semantic image segmentation.
It proposes to mine the contextual information beyond individual images to further augment the pixel representations.
The proposed method could be effortlessly incorporated into existing segmentation frameworks.
 arXiv  Detail & Related papers  (2021-08-26T14:34:23Z)
- Exploring Cross-Image Pixel Contrast for Semantic Segmentation [130.22216825377618]
 We propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting.
The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes.
Our method can be effortlessly incorporated into existing segmentation frameworks without extra overhead during testing.
 arXiv  Detail & Related papers  (2021-01-28T11:35:32Z)
- VICTR: Visual Information Captured Text Representation for Text-to-Image
  Multimodal Tasks [5.840117063192334]
 We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input.
We train the extracted objects, attributes, and relations in the scene graph and the corresponding geometric relation information using Graph Convolutional Networks.
The text representation is aggregated with word-level and sentence-level embedding to generate both visual contextual word and sentence representation.
 arXiv  Detail & Related papers  (2020-10-07T05:25:30Z)
- Cross-domain Correspondence Learning for Exemplar-based Image
  Translation [59.35767271091425]
 We present a framework for exemplar-based image translation, which synthesizes a photo-realistic image from the input in a distinct domain.
The output has the style (e.g., color, texture) in consistency with the semantically corresponding objects in the exemplar.
We show that our method is superior to state-of-the-art methods in terms of image quality significantly.
 arXiv  Detail & Related papers  (2020-04-12T09:10:57Z)
- Edge Guided GANs with Contrastive Learning for Semantic Image Synthesis [194.1452124186117]
 We propose a novel ECGAN for the challenging semantic image synthesis task.
Our ECGAN achieves significantly better results than state-of-the-art methods.
 arXiv  Detail & Related papers  (2020-03-31T01:23:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.