Long Short-Term Relation Networks for Video Action Detection
- URL: http://arxiv.org/abs/2003.14065v1
- Date: Tue, 31 Mar 2020 10:02:51 GMT
- Title: Long Short-Term Relation Networks for Video Action Detection
- Authors: Dong Li and Ting Yao and Zhaofan Qiu and Houqiang Li and Tao Mei
- Abstract summary: Long Short-Term Relation Networks (LSTR) are presented in this paper.
LSTR aggregates and propagates relation to augment features for video action detection.
Extensive experiments are conducted on four benchmark datasets.
- Score: 155.13392337831166
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It has been well recognized that modeling human-object or object-object
relations would be helpful for detection task. Nevertheless, the problem is not
trivial especially when exploring the interactions between human actor, object
and scene (collectively as human-context) to boost video action detectors. The
difficulty originates from the aspect that reliable relations in a video should
depend on not only short-term human-context relation in the present clip but
also the temporal dynamics distilled over a long-range span of the video. This
motivates us to capture both short-term and long-term relations in a video. In
this paper, we present a new Long Short-Term Relation Networks, dubbed as LSTR,
that novelly aggregates and propagates relation to augment features for video
action detection. Technically, Region Proposal Networks (RPN) is remoulded to
first produce 3D bounding boxes, i.e., tubelets, in each video clip. LSTR then
models short-term human-context interactions within each clip through
spatio-temporal attention mechanism and reasons long-term temporal dynamics
across video clips via Graph Convolutional Networks (GCN) in a cascaded manner.
Extensive experiments are conducted on four benchmark datasets, and superior
results are reported when comparing to state-of-the-art methods.
Related papers
- How Much Temporal Long-Term Context is Needed for Action Segmentation? [16.89998201009075]
We introduce a transformer-based model that leverages sparse attention to capture the full context of a video.
Our experiments show that modeling the full context of a video is necessary to obtain the best performance for temporal action segmentation.
arXiv Detail & Related papers (2023-08-22T11:20:40Z) - In Defense of Clip-based Video Relation Detection [32.05021939177942]
Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries.
We propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips.
Our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.
arXiv Detail & Related papers (2023-07-18T05:42:01Z) - FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial
Video Classification [49.06447472006251]
We propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification.
Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results.
arXiv Detail & Related papers (2022-09-22T21:15:58Z) - Generating Long Videos of Dynamic Scenes [66.56925105992472]
We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time.
A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
arXiv Detail & Related papers (2022-06-07T16:29:51Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video.
With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies.
The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z) - What and When to Look?: Temporal Span Proposal Network for Video Visual
Relation Detection [4.726777092009554]
Video Visual Relation Detection (VidD): segment-based and window-based.
We first point out the limitations these two methods have and propose Temporal Span Proposal Network (TSPN), a novel method with two advantages in terms of efficiency and effectiveness.
arXiv Detail & Related papers (2021-07-15T07:01:26Z) - ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction
Detection in Videos [91.29436920371003]
We propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI)
We use temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features.
We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.
arXiv Detail & Related papers (2021-05-25T07:54:35Z) - Temporal Relational Modeling with Self-Supervision for Action
Segmentation [38.62057004624234]
We introduce Dilated Temporal Graph Reasoning Module (DTGRM) to model temporal relations in video.
In particular, we capture and model temporal relations via constructing multi-level dilated temporal graphs.
Our model outperforms state-of-the-art action segmentation models on three challenging datasets.
arXiv Detail & Related papers (2020-12-14T13:41:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.