Related papers: Trunk-branch Contrastive Network with Multi-view Deformable Aggregation for Multi-view Action Recognition

Trunk-branch Contrastive Network with Multi-view Deformable Aggregation for Multi-view Action Recognition

URL: http://arxiv.org/abs/2502.16493v1
Date: Sun, 23 Feb 2025 08:10:20 GMT
Title: Trunk-branch Contrastive Network with Multi-view Deformable Aggregation for Multi-view Action Recognition
Authors: Yingyuan Yang, Guoyuan Liang, Can Wang, Xiaojun Wu,
Abstract summary: Multi-view action recognition aims to identify actions in a given multi-view scene.<n>We propose a novel trunk-branch contrastive network (TBCNet) for RGB-based multi-view action recognition.
Score: 8.99769677768336
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-view action recognition aims to identify actions in a given multi-view scene. Traditional studies initially extracted refined features from each view, followed by implemented paired interaction and integration, but they potentially overlooked the critical local features in each view. When observing objects from multiple perspectives, individuals typically form a comprehensive impression and subsequently fill in specific details. Drawing inspiration from this cognitive process, we propose a novel trunk-branch contrastive network (TBCNet) for RGB-based multi-view action recognition. Distinctively, TBCNet first obtains fused features in the trunk block and then implicitly supplements vital details provided by the branch block via contrastive learning, generating a more informative and comprehensive action representation. Within this framework, we construct two core components: the multi-view deformable aggregation and the trunk-branch contrastive learning. MVDA employed in the trunk block effectively facilitates multi-view feature fusion and adaptive cross-view spatio-temporal correlation, where a global aggregation module is utilized to emphasize significant spatial information and a composite relative position bias is designed to capture the intra- and cross-view relative positions. Moreover, a trunk-branch contrastive loss is constructed between aggregated features and refined details from each view. By incorporating two distinct weights for positive and negative samples, a weighted trunk-branch contrastive loss is proposed to extract valuable information and emphasize subtle inter-class differences. The effectiveness of TBCNet is verified by extensive experiments on four datasets including NTU-RGB+D 60, NTU-RGB+D 120, PKU-MMD, and N-UCLA dataset. Compared to other RGB-based methods, our approach achieves state-of-the-art performance in cross-subject and cross-setting protocols.

Related papers

Enhancing Semi-Supervised Multi-View Graph Convolutional Networks via Supervised Contrastive Learning and Self-Training [9.300953069946969]
graph convolutional network (GCN)-based multi-view learning provides a powerful framework for integrating structural information from heterogeneous views.<n>Existing methods often fail to fully exploit the complementary information across views, leading to suboptimal feature representations and limited performance.<n>We propose MV-SupGCN, a semi-supervised GCN model that integrates several complementary components with clear motivations and mutual reinforcement.
arXiv Detail & Related papers (2025-12-15T16:39:23Z)
Enhancing Graph Contrastive Learning with Reliable and Informative Augmentation for Recommendation [84.45144851024257]
We propose a novel framework that aims to enhance graph contrastive learning by constructing contrastive views with stronger collaborative information via discrete codes.<n>The core idea is to map users and items into discrete codes rich in collaborative information for reliable and informative contrastive view generation.
arXiv Detail & Related papers (2024-09-09T14:04:17Z)
Asymmetric double-winged multi-view clustering network for exploring Diverse and Consistent Information [28.300395619444796]
In unsupervised scenarios, deep contrastive multi-view clustering (DCMVC) is becoming a hot research spot. We propose a novel multi-view clustering network termed CodingNet to explore the diverse and consistent information simultaneously. Our framework's efficacy is validated through extensive experiments on six widely used benchmark datasets.
arXiv Detail & Related papers (2023-09-01T14:13:22Z)
DealMVC: Dual Contrastive Calibration for Multi-view Clustering [78.54355167448614]
We propose a novel Dual contrastive calibration network for Multi-View Clustering (DealMVC) We first design a fusion mechanism to obtain a global cross-view feature. Then, a global contrastive calibration loss is proposed by aligning the view feature similarity graph and the high-confidence pseudo-label graph. During the training procedure, the interacted cross-view feature is jointly optimized at both local and global levels.
arXiv Detail & Related papers (2023-08-17T14:14:28Z)
M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition. It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making. Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z)
Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition [45.0131792009999]
We propose a point cloud-based network named Two-stream Multi-level Dynamic Point Transformer for two-person interaction recognition. Our model addresses the challenge of recognizing two-person interactions by incorporating local-region spatial information, appearance information, and motion information. Our network outperforms state-of-the-art approaches in most standard evaluation settings.
arXiv Detail & Related papers (2023-07-22T03:51:32Z)
Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph. We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z)
Mutual Information Regularization for Weakly-supervised RGB-D Salient Object Detection [33.210575826086654]
We present a weakly-supervised RGB-D salient object detection model via supervision. We focus on effective multimodal representation learning via inter-modal mutual information regularization.
arXiv Detail & Related papers (2023-06-06T12:36:57Z)
Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network. We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs. Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z)
Specificity-preserving RGB-D Saliency Detection [103.3722116992476]
We propose a specificity-preserving network (SP-Net) for RGB-D saliency detection. Two modality-specific networks and a shared learning network are adopted to generate individual and shared saliency maps. Experiments on six benchmark datasets demonstrate that our SP-Net outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2021-08-18T14:14:22Z)
RGBT Tracking via Multi-Adapter Network with Hierarchical Divergence Loss [37.99375824040946]
We propose a novel multi-adapter network to jointly perform modality-shared, modality-specific and instance-aware target representation learning. Experiments on two RGBT tracking benchmark datasets demonstrate the outstanding performance of the proposed tracker.
arXiv Detail & Related papers (2020-11-14T01:50:46Z)
Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-based Person Re-identification [98.7585431239291]
Video-based person re-identification aims at matching the same person across video clips. In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-Attentive Feature aggregation module MG-RAFA. Our framework achieves the state-of-the-art ablation performance on three benchmark datasets.
arXiv Detail & Related papers (2020-03-27T03:49:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.