Temporal Relational Modeling with Self-Supervision for Action
Segmentation
- URL: http://arxiv.org/abs/2012.07508v1
- Date: Mon, 14 Dec 2020 13:41:28 GMT
- Title: Temporal Relational Modeling with Self-Supervision for Action
Segmentation
- Authors: Dong Wang, Di Hu, Xingjian Li, Dejing Dou
- Abstract summary: We introduce Dilated Temporal Graph Reasoning Module (DTGRM) to model temporal relations in video.
In particular, we capture and model temporal relations via constructing multi-level dilated temporal graphs.
Our model outperforms state-of-the-art action segmentation models on three challenging datasets.
- Score: 38.62057004624234
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Temporal relational modeling in video is essential for human action
understanding, such as action recognition and action segmentation. Although
Graph Convolution Networks (GCNs) have shown promising advantages in relation
reasoning on many tasks, it is still a challenge to apply graph convolution
networks on long video sequences effectively. The main reason is that large
number of nodes (i.e., video frames) makes GCNs hard to capture and model
temporal relations in videos. To tackle this problem, in this paper, we
introduce an effective GCN module, Dilated Temporal Graph Reasoning Module
(DTGRM), designed to model temporal relations and dependencies between video
frames at various time spans. In particular, we capture and model temporal
relations via constructing multi-level dilated temporal graphs where the nodes
represent frames from different moments in video. Moreover, to enhance temporal
reasoning ability of the proposed model, an auxiliary self-supervised task is
proposed to encourage the dilated temporal graph reasoning module to find and
correct wrong temporal relations in videos. Our DTGRM model outperforms
state-of-the-art action segmentation models on three challenging datasets:
50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset.
The code is available at https://github.com/redwang/DTGRM.
Related papers
- SelfGNN: Self-Supervised Graph Neural Networks for Sequential Recommendation [15.977789295203976]
We propose a novel framework called Self-Supervised Graph Neural Network (SelfGNN) for sequential recommendation.
The SelfGNN framework encodes short-term graphs based on time intervals and utilizes Graph Neural Networks (GNNs) to learn short-term collaborative relationships.
Our personalized self-augmented learning structure enhances model robustness by mitigating noise in short-term graphs based on long-term user interests and personal stability.
arXiv Detail & Related papers (2024-05-31T14:53:12Z) - Local-Global Information Interaction Debiasing for Dynamic Scene Graph
Generation [51.92419880088668]
We propose a novel DynSGG model based on multi-task learning, DynSGG-MTL, which introduces the local interaction information and global human-action interaction information.
Long-temporal human actions supervise the model to generate multiple scene graphs that conform to the global constraints and avoid the model being unable to learn the tail predicates.
arXiv Detail & Related papers (2023-08-10T01:24:25Z) - Multi-Task Edge Prediction in Temporally-Dynamic Video Graphs [16.121140184388786]
We propose MTD-GNN, a graph network for predicting temporally-dynamic edges for multiple types of relations.
We show that modeling multiple relations in our temporal-dynamic graph network can be mutually beneficial.
arXiv Detail & Related papers (2022-12-06T10:41:00Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action
Segmentation in Videos [0.40778318140713216]
This study introduces a graph-structured approach named Semantic2Graph, to model long-term dependencies in videos.
We have designed positive and negative semantic edges, accompanied by corresponding edge weights, to capture both long-term and short-term semantic relationships in video actions.
arXiv Detail & Related papers (2022-09-13T00:01:23Z) - TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Temporal Contrastive Graph Learning for Video Action Recognition and
Retrieval [83.56444443849679]
This work takes advantage of the temporal dependencies within videos and proposes a novel self-supervised method named Temporal Contrastive Graph Learning (TCGL)
Our TCGL roots in a hybrid graph contrastive learning strategy to jointly regard the inter-snippet and intra-snippet temporal dependencies as self-supervision signals for temporal representation learning.
Experimental results demonstrate the superiority of our TCGL over the state-of-the-art methods on large-scale action recognition and video retrieval benchmarks.
arXiv Detail & Related papers (2021-01-04T08:11:39Z) - Temporal Graph Modeling for Skeleton-based Action Recognition [25.788239844759246]
We propose a Temporal Enhanced Graph Convolutional Network (TE-GCN) to capture complex temporal dynamic.
The constructed temporal relation graph explicitly builds connections between semantically related temporal features.
Experiments are performed on two widely used large-scale datasets.
arXiv Detail & Related papers (2020-12-16T09:02:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.