Semi-supervised 3D Video Information Retrieval with Deep Neural Network
and Bi-directional Dynamic-time Warping Algorithm
- URL: http://arxiv.org/abs/2309.01063v1
- Date: Sun, 3 Sep 2023 03:10:18 GMT
- Title: Semi-supervised 3D Video Information Retrieval with Deep Neural Network
and Bi-directional Dynamic-time Warping Algorithm
- Authors: Yintai Ma, Diego Klabjan
- Abstract summary: The proposed algorithm is designed to handle large video datasets and retrieve the most related videos to a given inquiry video clip.
We split both the candidate and the inquiry videos into a sequence of clips and convert each clip to a representation vector using an autoencoder-backed deep neural network.
We then calculate a similarity measure between the sequences of embedding vectors using a bi-directional dynamic time-warping method.
- Score: 14.39527406033429
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper presents a novel semi-supervised deep learning algorithm for
retrieving similar 2D and 3D videos based on visual content. The proposed
approach combines the power of deep convolutional and recurrent neural networks
with dynamic time warping as a similarity measure. The proposed algorithm is
designed to handle large video datasets and retrieve the most related videos to
a given inquiry video clip based on its graphical frames and contents. We split
both the candidate and the inquiry videos into a sequence of clips and convert
each clip to a representation vector using an autoencoder-backed deep neural
network. We then calculate a similarity measure between the sequences of
embedding vectors using a bi-directional dynamic time-warping method. This
approach is tested on multiple public datasets, including CC\_WEB\_VIDEO,
Youtube-8m, S3DIS, and Synthia, and showed good results compared to
state-of-the-art. The algorithm effectively solves video retrieval tasks and
outperforms the benchmarked state-of-the-art deep learning model.
Related papers
- 3D-CSL: self-supervised 3D context similarity learning for
Near-Duplicate Video Retrieval [17.69904571043164]
We introduce 3D-SL, a compact pipeline for Near-Duplicate Video Retrieval (NDVR)
We propose a two-stage self-supervised similarity learning strategy to optimize the network.
Our method achieves the state-of-the-art performance on clip-level NDVR.
arXiv Detail & Related papers (2022-11-10T05:51:08Z) - Scalable Neural Video Representations with Learnable Positional Features [73.51591757726493]
We show how to train neural representations with learnable positional features (NVP) that effectively amortize a video as latent codes.
We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07rightarrow$34.57 (measured with the PSNR metric)
arXiv Detail & Related papers (2022-10-13T08:15:08Z) - A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs)
The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved.
We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z) - Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency.
Our method is only trained on a pair of original and processed videos directly instead of a large dataset.
We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z) - VideoPose: Estimating 6D object pose from videos [14.210010379733017]
We introduce a simple yet effective algorithm that uses convolutional neural networks to directly estimate object poses from videos.
Our proposed network takes a pre-trained 2D object detector as input, and aggregates visual features through a recurrent neural network to make predictions at each frame.
Experimental evaluation on the YCB-Video dataset show that our approach is on par with the state-of-the-art algorithms.
arXiv Detail & Related papers (2021-11-20T20:57:45Z) - 2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video
Recognition [84.697097472401]
We introduce Ada3D, a conditional computation framework that learns instance-specific 3D usage policies to determine frames and convolution layers to be used in a 3D network.
We demonstrate that our method achieves similar accuracies to state-of-the-art 3D models while requiring 20%-50% less computation across different datasets.
arXiv Detail & Related papers (2020-12-29T21:40:38Z) - Audio-based Near-Duplicate Video Retrieval with Audio Similarity
Learning [19.730467023817123]
We propose the Audio Similarity Learning (AuSiL) approach that effectively captures temporal patterns of audio similarity between video pairs.
We train our network following a triplet generation process and optimize the triplet loss function.
The proposed approach achieves very competitive results compared to three state-of-the-art methods.
arXiv Detail & Related papers (2020-10-17T08:12:18Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - Making a Case for 3D Convolutions for Object Segmentation in Videos [16.167397418720483]
We show that 3D convolutional networks can be effectively applied to dense video prediction tasks such as salient object segmentation.
We propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules.
Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal benchmarks.
arXiv Detail & Related papers (2020-08-26T12:24:23Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.