Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action
Segmentation in Videos
- URL: http://arxiv.org/abs/2209.05653v5
- Date: Tue, 6 Feb 2024 11:12:02 GMT
- Title: Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action
Segmentation in Videos
- Authors: Junbin Zhang, Pei-Hsuan Tsai and Meng-Hsun Tsai
- Abstract summary: This study introduces a graph-structured approach named Semantic2Graph, to model long-term dependencies in videos.
We have designed positive and negative semantic edges, accompanied by corresponding edge weights, to capture both long-term and short-term semantic relationships in video actions.
- Score: 0.40778318140713216
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video action segmentation have been widely applied in many fields. Most
previous studies employed video-based vision models for this purpose. However,
they often rely on a large receptive field, LSTM or Transformer methods to
capture long-term dependencies within videos, leading to significant
computational resource requirements. To address this challenge, graph-based
model was proposed. However, previous graph-based models are less accurate.
Hence, this study introduces a graph-structured approach named Semantic2Graph,
to model long-term dependencies in videos, thereby reducing computational costs
and raise the accuracy. We construct a graph structure of video at the
frame-level. Temporal edges are utilized to model the temporal relations and
action order within videos. Additionally, we have designed positive and
negative semantic edges, accompanied by corresponding edge weights, to capture
both long-term and short-term semantic relationships in video actions. Node
attributes encompass a rich set of multi-modal features extracted from video
content, graph structures, and label text, encompassing visual, structural, and
semantic cues. To synthesize this multi-modal information effectively, we
employ a graph neural network (GNN) model to fuse multi-modal features for node
action label classification. Experimental results demonstrate that
Semantic2Graph outperforms state-of-the-art methods in terms of performance,
particularly on benchmark datasets such as GTEA and 50Salads. Multiple ablation
experiments further validate the effectiveness of semantic features in
enhancing model performance. Notably, the inclusion of semantic edges in
Semantic2Graph allows for the cost-effective capture of long-term dependencies,
affirming its utility in addressing the challenges posed by computational
resource constraints in video-based vision models.
Related papers
- Multi-Scene Generalized Trajectory Global Graph Solver with Composite
Nodes for Multiple Object Tracking [61.69892497726235]
Composite Node Message Passing Network (CoNo-Link) is a framework for modeling ultra-long frames information for association.
In addition to the previous method of treating objects as nodes, the network innovatively treats object trajectories as nodes for information interaction.
Our model can learn better predictions on longer-time scales by adding composite nodes.
arXiv Detail & Related papers (2023-12-14T14:00:30Z) - Multi-Task Edge Prediction in Temporally-Dynamic Video Graphs [16.121140184388786]
We propose MTD-GNN, a graph network for predicting temporally-dynamic edges for multiple types of relations.
We show that modeling multiple relations in our temporal-dynamic graph network can be mutually beneficial.
arXiv Detail & Related papers (2022-12-06T10:41:00Z) - MGNNI: Multiscale Graph Neural Networks with Implicit Layers [53.75421430520501]
implicit graph neural networks (GNNs) have been proposed to capture long-range dependencies in underlying graphs.
We introduce and justify two weaknesses of implicit GNNs: the constrained expressiveness due to their limited effective range for capturing long-range dependencies, and their lack of ability to capture multiscale information on graphs at multiple resolutions.
We propose a multiscale graph neural network with implicit layers (MGNNI) which is able to model multiscale structures on graphs and has an expanded effective range for capturing long-range dependencies.
arXiv Detail & Related papers (2022-10-15T18:18:55Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - Adaptive graph convolutional networks for weakly supervised anomaly
detection in videos [42.3118758940767]
We propose a weakly supervised adaptive graph convolutional network (WAGCN) to model the contextual relationships among video segments.
We fully consider the influence of other video segments on the current segment when generating the anomaly probability score for each segment.
arXiv Detail & Related papers (2022-02-14T06:31:34Z) - TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Temporal Relational Modeling with Self-Supervision for Action
Segmentation [38.62057004624234]
We introduce Dilated Temporal Graph Reasoning Module (DTGRM) to model temporal relations in video.
In particular, we capture and model temporal relations via constructing multi-level dilated temporal graphs.
Our model outperforms state-of-the-art action segmentation models on three challenging datasets.
arXiv Detail & Related papers (2020-12-14T13:41:28Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.