Situational Perception Guided Image Matting
- URL: http://arxiv.org/abs/2204.09276v3
- Date: Fri, 22 Apr 2022 11:01:16 GMT
- Title: Situational Perception Guided Image Matting
- Authors: Bo Xu and Jiake Xie and Han Huang and Ziwen Li and Cheng Lu and Yong
Tang and Yandong Guo
- Abstract summary: We propose a Situational Perception Guided Image Matting (SPG-IM) method that mitigates subjective bias of matting annotations.
SPG-IM can better associate inter-objects and object-to-environment saliency, and compensate the subjective nature of image matting.
- Score: 16.1897179939677
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most automatic matting methods try to separate the salient foreground from
the background. However, the insufficient quantity and subjective bias of the
current existing matting datasets make it difficult to fully explore the
semantic association between object-to-object and object-to-environment in a
given image. In this paper, we propose a Situational Perception Guided Image
Matting (SPG-IM) method that mitigates subjective bias of matting annotations
and captures sufficient situational perception information for better global
saliency distilled from the visual-to-textual task. SPG-IM can better associate
inter-objects and object-to-environment saliency, and compensate the subjective
nature of image matting and its expensive annotation. We also introduce a
textual Semantic Transformation (TST) module that can effectively transform and
integrate the semantic feature stream to guide the visual representations. In
addition, an Adaptive Focal Transformation (AFT) Refinement Network is proposed
to adaptively switch multi-scale receptive fields and focal points to enhance
both global and local details. Extensive experiments demonstrate the
effectiveness of situational perception guidance from the visual-to-textual
tasks on image matting, and our model outperforms the state-of-the-art methods.
We also analyze the significance of different components in our model. The code
will be released soon.
Related papers
- Learning with Multi-modal Gradient Attention for Explainable Composed
Image Retrieval [15.24270990274781]
We propose a new gradient-attention-based learning objective that explicitly forces the model to focus on the local regions of interest being modified in each retrieval step.
We show how MMGrad can be incorporated into an end-to-end model training strategy with a new learning objective that explicitly forces these MMGrad attention maps to highlight the correct local regions corresponding to the modifier text.
arXiv Detail & Related papers (2023-08-31T11:46:27Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - Image-Specific Information Suppression and Implicit Local Alignment for
Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text.
Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities.
We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z) - SemAug: Semantically Meaningful Image Augmentations for Object Detection
Through Language Grounding [5.715548995729382]
We propose an effective technique for image augmentation by injecting contextually meaningful knowledge into the scenes.
Our method of semantically meaningful image augmentation for object detection via language grounding, SemAug, starts by calculating semantically appropriate new objects.
arXiv Detail & Related papers (2022-08-15T19:00:56Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Marginal Contrastive Correspondence for Guided Image Generation [58.0605433671196]
Exemplar-based image translation establishes dense correspondences between a conditional input and an exemplar from two different domains.
Existing work builds the cross-domain correspondences implicitly by minimizing feature-wise distances across the two domains.
We design a Marginal Contrastive Learning Network (MCL-Net) that explores contrastive learning to learn domain-invariant features for realistic exemplar-based image translation.
arXiv Detail & Related papers (2022-04-01T13:55:44Z) - Towards Full-to-Empty Room Generation with Structure-Aware Feature
Encoding and Soft Semantic Region-Adaptive Normalization [67.64622529651677]
We propose a simple yet effective adjusted fully differentiable soft semantic region-adaptive normalization module (softSEAN) block.
Our approach besides the advantages of mitigating training complexity and non-differentiability issues surpasses the compared methods both quantitatively and qualitatively.
Our softSEAN block can be used as a drop-in module for existing discriminative and generative models.
arXiv Detail & Related papers (2021-12-10T09:00:13Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.