Virtual Multi-Modality Self-Supervised Foreground Matting for
Human-Object Interaction
- URL: http://arxiv.org/abs/2110.03278v1
- Date: Thu, 7 Oct 2021 09:03:01 GMT
- Title: Virtual Multi-Modality Self-Supervised Foreground Matting for
Human-Object Interaction
- Authors: Bo Xu, Han Huang, Cheng Lu, Ziwen Li and Yandong Guo
- Abstract summary: We propose a Virtual Multi-modality Foreground Matting (VMFM) method to learn human-object interactive foreground.
VMFM method requires no additional inputs, e.g. trimap or known background.
We reformulate foreground matting as a self-supervised multi-modality problem.
- Score: 18.14237514372724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing human matting algorithms tried to separate pure human-only
foreground from the background. In this paper, we propose a Virtual
Multi-modality Foreground Matting (VMFM) method to learn human-object
interactive foreground (human and objects interacted with him or her) from a
raw RGB image. The VMFM method requires no additional inputs, e.g. trimap or
known background. We reformulate foreground matting as a self-supervised
multi-modality problem: factor each input image into estimated depth map,
segmentation mask, and interaction heatmap using three auto-encoders. In order
to fully utilize the characteristics of each modality, we first train a dual
encoder-to-decoder network to estimate the same alpha matte. Then we introduce
a self-supervised method: Complementary Learning(CL) to predict deviation
probability map and exchange reliable gradients across modalities without
label. We conducted extensive experiments to analyze the effectiveness of each
modality and the significance of different components in complementary
learning. We demonstrate that our model outperforms the state-of-the-art
methods.
Related papers
- End-to-end Semantic-centric Video-based Multimodal Affective Computing [27.13963885724786]
We propose a novel end-to-end framework named SemanticMAC to compute multimodal semantic-centric affection for human-spoken videos.
We employ pre-trained Transformer model in multimodal data pre-processing and design Affective Perceiver module to capture unimodal affective information.
SemanticMAC effectively learn specific- and shared-semantic representations in the guidance of semantic-centric labels.
arXiv Detail & Related papers (2024-08-14T17:50:27Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Motor Imagery Decoding Using Ensemble Curriculum Learning and
Collaborative Training [11.157243900163376]
Multi-subject EEG datasets present several kinds of domain shifts.
These domain shifts impede robust cross-subject generalization.
We propose a two-stage model ensemble architecture built with multiple feature extractors.
We demonstrate that our model ensembling approach combines the powers of curriculum learning and collaborative training.
arXiv Detail & Related papers (2022-11-21T13:45:44Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data [13.68491474904529]
We propose Text-enhanced Visual Deep InfoMax (TVDIM) to learn better visual representations.
Our core idea of self-supervised learning is to maximize the mutual information between features extracted from multiple views.
TVDIM significantly outperforms previous visual self-supervised methods when processing the same set of images.
arXiv Detail & Related papers (2021-06-03T12:36:01Z) - Combining Semantic Guidance and Deep Reinforcement Learning For
Generating Human Level Paintings [22.889059874754242]
Generation of stroke-based non-photorealistic imagery is an important problem in the computer vision community.
Previous methods have been limited to datasets with little variation in position, scale and saliency of the foreground object.
We propose a Semantic Guidance pipeline with 1) a bi-level painting procedure for learning the distinction between foreground and background brush strokes at training time.
arXiv Detail & Related papers (2020-11-25T09:00:04Z) - Monocular, One-stage, Regression of Multiple 3D People [105.3143785498094]
We propose to Regress all meshes in a One-stage fashion for Multiple 3D People (termed ROMP)
Our method simultaneously predicts a Body Center heatmap and a Mesh map, which can jointly describe the 3D body mesh on the pixel level.
Compared with state-of-the-art methods, ROMP superior performance on the challenging multi-person benchmarks.
arXiv Detail & Related papers (2020-08-27T17:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.