Video Relation Detection via Tracklet based Visual Transformer
- URL: http://arxiv.org/abs/2108.08669v1
- Date: Thu, 19 Aug 2021 13:13:23 GMT
- Title: Video Relation Detection via Tracklet based Visual Transformer
- Authors: Kaifeng Gao, Long Chen, Yifeng Huang, Jun Xiao
- Abstract summary: Video Visual Relation Detection (VidVRD) has received significant attention of our community over recent years.
We apply the state-of-the-art video object tracklet detection pipeline MEGA and deepSORT to generate tracklet proposals.
Then we perform VidVRD in a tracklet-based manner without any pre-cutting operations.
- Score: 12.31184296559801
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video Visual Relation Detection (VidVRD), has received significant attention
of our community over recent years. In this paper, we apply the
state-of-the-art video object tracklet detection pipeline MEGA and deepSORT to
generate tracklet proposals. Then we perform VidVRD in a tracklet-based manner
without any pre-cutting operations. Specifically, we design a tracklet-based
visual Transformer. It contains a temporal-aware decoder which performs feature
interactions between the tracklets and learnable predicate query embeddings,
and finally predicts the relations. Experimental results strongly demonstrate
the superiority of our method, which outperforms other methods by a large
margin on the Video Relation Understanding (VRU) Grand Challenge in ACM
Multimedia 2021. Codes are released at
https://github.com/Dawn-LX/VidVRD-tracklets.
Related papers
- VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking [61.56592503861093]
This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT)
Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens.
We propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint.
arXiv Detail & Related papers (2024-10-11T05:01:49Z) - TrackGo: A Flexible and Efficient Method for Controllable Video Generation [32.906496577618924]
We introduce TrackGo, a novel approach for conditional video generation.
TrackGo offers users with a flexible and precise mechanism for manipulating video content.
We also propose the TrackAdapter for control implementation.
arXiv Detail & Related papers (2024-08-21T09:42:04Z) - AViTMP: A Tracking-Specific Transformer for Single-Branch Visual Tracking [17.133735660335343]
We propose an Adaptive ViT Model Prediction tracker (AViTMP) to design a customised tracking method.
This method bridges the single-branch network with discriminative models for the first time.
We show that AViTMP achieves state-of-the-art performance, especially in terms of long-term tracking and robustness.
arXiv Detail & Related papers (2023-10-30T13:48:04Z) - Tracking by Associating Clips [110.08925274049409]
In this paper, we investigate an alternative by treating object association as clip-wise matching.
Our new perspective views a single long video sequence as multiple short clips, and then the tracking is performed both within and between the clips.
The benefits of this new approach are two folds. First, our method is robust to tracking error accumulation or propagation, as the video chunking allows bypassing the interrupted frames.
Second, the multiple frame information is aggregated during the clip-wise matching, resulting in a more accurate long-range track association than the current frame-wise matching.
arXiv Detail & Related papers (2022-12-20T10:33:17Z) - It Takes Two: Masked Appearance-Motion Modeling for Self-supervised
Video Transformer Pre-training [76.69480467101143]
Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline.
We explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling framework.
Our method learns generalized video representations and achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.
arXiv Detail & Related papers (2022-10-11T08:05:18Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - End-to-End Referring Video Object Segmentation with Multimodal
Transformers [0.0]
We propose a simple Transformer-based approach to the referring video object segmentation task.
Our framework, termed Multimodal Tracking Transformer (MTTR), models the RVOS task as a sequence prediction problem.
MTTR is end-to-end trainable, free of text-related inductive bias components and requires no additional mask-refinement post-processing steps.
arXiv Detail & Related papers (2021-11-29T18:59:32Z) - Split and Connect: A Universal Tracklet Booster for Multi-Object
Tracking [33.23825397557663]
Multi-object tracking (MOT) is an essential task in the computer vision field.
In this paper, a tracklet booster algorithm is proposed, which can be built upon any other tracker.
The motivation is simple and straightforward: split tracklets on potential ID-switch positions and then connect multiple tracklets into one if they are from the same object.
arXiv Detail & Related papers (2021-05-06T03:49:19Z) - Video Transformer Network [0.0]
This paper presents a transformer-based framework for video recognition.
Inspired by recent developments in vision transformers, we ditch the standard approach in video action recognition that relies on 3D ConvNets.
Our approach is generic and builds on top of any given 2D spatial network.
arXiv Detail & Related papers (2021-02-01T09:29:10Z) - TrackFormer: Multi-Object Tracking with Transformers [92.25832593088421]
TrackFormer is an end-to-end multi-object tracking and segmentation model based on an encoder-decoder Transformer architecture.
New track queries are spawned by the DETR object detector and embed the position of their corresponding object over time.
TrackFormer achieves a seamless data association between frames in a new tracking-by-attention paradigm.
arXiv Detail & Related papers (2021-01-07T18:59:29Z) - TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training
Model [51.14840210957289]
Multi-object tracking is a fundamental vision problem that has been studied for a long time.
Despite the success of Tracking by Detection (TBD), this two-step method is too complicated to train in an end-to-end manner.
We propose a concise end-to-end model TubeTK which only needs one step training by introducing the bounding-tube" to indicate temporal-spatial locations of objects in a short video clip.
arXiv Detail & Related papers (2020-06-10T06:45:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.