Video alignment using unsupervised learning of local and global features
- URL: http://arxiv.org/abs/2304.06841v2
- Date: Wed, 11 Oct 2023 09:53:51 GMT
- Title: Video alignment using unsupervised learning of local and global features
- Authors: Niloufar Fakhfour, Mohammad ShahverdiKondori, Hoda Mohammadzade
- Abstract summary: We introduce an unsupervised method for alignment that uses global and local features of the frames.
In particular, we introduce effective features for each video frame using three machine vision tools: person detection, pose estimation, and VGG network.
The resulting time series are used to align videos of the same actions using a novel version of dynamic time warping named Diagonalized Dynamic Time Warping(DDTW)
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we tackle the problem of video alignment, the process of
matching the frames of a pair of videos containing similar actions. The main
challenge in video alignment is that accurate correspondence should be
established despite the differences in the execution processes and appearances
between the two videos. We introduce an unsupervised method for alignment that
uses global and local features of the frames. In particular, we introduce
effective features for each video frame using three machine vision tools:
person detection, pose estimation, and VGG network. Then, the features are
processed and combined to construct a multidimensional time series that
represents the video. The resulting time series are used to align videos of the
same actions using a novel version of dynamic time warping named Diagonalized
Dynamic Time Warping(DDTW). The main advantage of our approach is that no
training is required, which makes it applicable for any new type of action
without any need to collect training samples for it. For evaluation, we
considered video synchronization and phase classification tasks on the Penn
action dataset. Also, for an effective evaluation of the video synchronization
task, we present a new metric called Enclosed Area Error(EAE). The results show
that our method outperforms previous state-of-the-art methods, such as TCC, and
other self-supervised and weakly supervised methods.
Related papers
- Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - Few-Shot Action Localization without Knowing Boundaries [9.959844922120523]
We show that it is possible to learn to localize actions in untrimmed videos when only one/few trimmed examples of the target action are available at test time.
We propose a network that learns to estimate Temporal Similarity Matrices (TSMs) that model a fine-grained similarity pattern between pairs of videos.
Our method achieves performance comparable or better to state-of-the-art fully-supervised, few-shot learning methods.
arXiv Detail & Related papers (2021-06-08T07:32:43Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Learning by Aligning Videos in Time [10.075645944474287]
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task.
We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network.
arXiv Detail & Related papers (2021-03-31T17:55:52Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z) - Action Graphs: Weakly-supervised Action Localization with Graph
Convolution Networks [25.342482374259017]
We present a method for weakly-supervised action localization based on graph convolutions.
Our method utilizes similarity graphs that encode appearance and motion, and pushes the state of the art on THUMOS '14, ActivityNet 1.2, and Charades for weakly supervised action localization.
arXiv Detail & Related papers (2020-02-04T18:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.