In Defense of Clip-based Video Relation Detection
- URL: http://arxiv.org/abs/2307.08984v1
- Date: Tue, 18 Jul 2023 05:42:01 GMT
- Title: In Defense of Clip-based Video Relation Detection
- Authors: Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Roger Zimmermann
- Abstract summary: Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries.
We propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips.
Our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.
- Score: 32.05021939177942
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Visual Relation Detection (VidVRD) aims to detect visual relationship
triplets in videos using spatial bounding boxes and temporal boundaries.
Existing VidVRD methods can be broadly categorized into bottom-up and top-down
paradigms, depending on their approach to classifying relations. Bottom-up
methods follow a clip-based approach where they classify relations of short
clip tubelet pairs and then merge them into long video relations. On the other
hand, top-down methods directly classify long video tubelet pairs. While recent
video-based methods utilizing video tubelets have shown promising results, we
argue that the effective modeling of spatial and temporal context plays a more
significant role than the choice between clip tubelets and video tubelets. This
motivates us to revisit the clip-based paradigm and explore the key success
factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM)
that enriches the object-based spatial context and relation-based temporal
context based on clips. We demonstrate that using clip tubelets can achieve
superior performance compared to most video-based methods. Additionally, using
clip tubelets offers more flexibility in model designs and helps alleviate the
limitations associated with video tubelets, such as the challenging long-term
object tracking problem and the loss of temporal information in long-term
tubelet feature compression. Extensive experiments conducted on two challenging
VidVRD benchmarks validate that our HCM achieves a new state-of-the-art
performance, highlighting the effectiveness of incorporating advanced spatial
and temporal context modeling within the clip-based paradigm.
Related papers
- HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions [59.71751978599567]
This paper presents a novel annotation pipeline that uses pre-extracted features and dimensionality reduction to accelerate the temporal video annotation process.
We demonstrate significant improvements in annotation effort compared to traditional linear methods, achieving more than a 10x reduction in clicks required for annotating over 12 hours of video.
arXiv Detail & Related papers (2024-09-16T18:15:38Z) - Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for
Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length.
A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length.
This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z) - FODVid: Flow-guided Object Discovery in Videos [12.792602427704395]
We focus on building a generalizable solution that avoids overfitting to the individual intricacies.
To solve Video Object (VOS) in an unsupervised setting, we propose a new pipeline (FODVid) based on the idea of guiding segmentation outputs.
arXiv Detail & Related papers (2023-07-10T07:55:42Z) - Video Demoireing with Relation-Based Temporal Consistency [68.20281109859998]
Moire patterns, appearing as color distortions, severely degrade image and video qualities when filming a screen with digital cameras.
We study how to remove such undesirable moire patterns in videos, namely video demoireing.
arXiv Detail & Related papers (2022-04-06T17:45:38Z) - Controllable Augmentations for Video Representation Learning [34.79719112810065]
We propose a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as minimization general long-term temporal relations.
Our framework is superior on three video benchmarks in action recognition and video retrieval, capturing more accurate temporal dynamics.
arXiv Detail & Related papers (2022-03-30T19:34:32Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z) - Long Short-Term Relation Networks for Video Action Detection [155.13392337831166]
Long Short-Term Relation Networks (LSTR) are presented in this paper.
LSTR aggregates and propagates relation to augment features for video action detection.
Extensive experiments are conducted on four benchmark datasets.
arXiv Detail & Related papers (2020-03-31T10:02:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.