Feature-Supervised Action Modality Transfer
- URL: http://arxiv.org/abs/2108.03329v1
- Date: Fri, 6 Aug 2021 22:59:10 GMT
- Title: Feature-Supervised Action Modality Transfer
- Authors: Fida Mohammad Thoker, Cees G. M. Snoek
- Abstract summary: This paper strives for action recognition and detection in video modalities when only limited modality-specific labeled examples are available.
For the RGB, and derived optical-flow, modality many large-scale labeled datasets have been made available.
- Score: 35.550525307238146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper strives for action recognition and detection in video modalities
like RGB, depth maps or 3D-skeleton sequences when only limited
modality-specific labeled examples are available. For the RGB, and derived
optical-flow, modality many large-scale labeled datasets have been made
available. They have become the de facto pre-training choice when recognizing
or detecting new actions from RGB datasets that have limited amounts of labeled
examples available. Unfortunately, large-scale labeled action datasets for
other modalities are unavailable for pre-training. In this paper, our goal is
to recognize actions from limited examples in non-RGB video modalities, by
learning from large-scale labeled RGB data. To this end, we propose a two-step
training process: (i) we extract action representation knowledge from an
RGB-trained teacher network and adapt it to a non-RGB student network. (ii) we
then fine-tune the transfer model with available labeled examples of the target
modality. For the knowledge transfer we introduce feature-supervision
strategies, which rely on unlabeled pairs of two modalities (the RGB and the
target modality) to transfer feature level representations from the teacher to
the student network. Ablations and generalizations with two RGB source datasets
and two non-RGB target datasets demonstrate that an optical-flow teacher
provides better action transfer features than RGB for both depth maps and
3D-skeletons, even when evaluated on a different target domain, or for a
different task. Compared to alternative cross-modal action transfer methods we
show a good improvement in performance especially when labeled non-RGB examples
to learn from are scarce
Related papers
- TENet: Targetness Entanglement Incorporating with Multi-Scale Pooling and Mutually-Guided Fusion for RGB-E Object Tracking [30.89375068036783]
Existing approaches perform event feature extraction for RGB-E tracking using traditional appearance models.
We propose an Event backbone (Pooler) to obtain a high-quality feature representation that is cognisant of the intrinsic characteristics of the event data.
Our method significantly outperforms state-of-the-art trackers on two widely used RGB-E tracking datasets.
arXiv Detail & Related papers (2024-05-08T12:19:08Z) - DFormer: Rethinking RGBD Representation Learning for Semantic
Segmentation [76.81628995237058]
DFormer is a novel framework to learn transferable representations for RGB-D segmentation tasks.
It pretrains the backbone using image-depth pairs from ImageNet-1K.
DFormer achieves new state-of-the-art performance on two popular RGB-D tasks.
arXiv Detail & Related papers (2023-09-18T11:09:11Z) - PointMBF: A Multi-scale Bidirectional Fusion Network for Unsupervised
RGB-D Point Cloud Registration [6.030097207369754]
We propose a network implementing multi-scale bidirectional fusion between RGB images and point clouds generated from depth images.
Our method achieves new state-of-the-art performance.
arXiv Detail & Related papers (2023-08-09T08:13:46Z) - RGB-D Saliency Detection via Cascaded Mutual Information Minimization [122.8879596830581]
Existing RGB-D saliency detection models do not explicitly encourage RGB and depth to achieve effective multi-modal learning.
We introduce a novel multi-stage cascaded learning framework via mutual information minimization to "explicitly" model the multi-modal information between RGB image and depth data.
arXiv Detail & Related papers (2021-09-15T12:31:27Z) - Cross-modality Discrepant Interaction Network for RGB-D Salient Object
Detection [78.47767202232298]
We propose a novel Cross-modality Discrepant Interaction Network (CDINet) for RGB-D SOD.
Two components are designed to implement the effective cross-modality interaction.
Our network outperforms $15$ state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2021-08-04T11:24:42Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z) - P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for
RGB-D Scene Understanding [24.93545970229774]
We propose contrasting "pairs of point-pixel pairs", where positives include pairs of RGB-D points in correspondence, and negatives include pairs where one of the two modalities has been disturbed.
This provides extra flexibility in making hard negatives and helps networks to learn features from both modalities, not just the more discriminating one of the two.
arXiv Detail & Related papers (2020-12-24T04:00:52Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z) - Synergistic saliency and depth prediction for RGB-D saliency detection [76.27406945671379]
Existing RGB-D saliency datasets are small, which may lead to overfitting and limited generalization for diverse scenarios.
We propose a semi-supervised system for RGB-D saliency detection that can be trained on smaller RGB-D saliency datasets without saliency ground truth.
arXiv Detail & Related papers (2020-07-03T14:24:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.