Encode the Unseen: Predictive Video Hashing for Scalable Mid-Stream
Retrieval
- URL: http://arxiv.org/abs/2009.14661v2
- Date: Fri, 2 Oct 2020 13:11:34 GMT
- Title: Encode the Unseen: Predictive Video Hashing for Scalable Mid-Stream
Retrieval
- Authors: Tong Yu, Nicolas Padoy
- Abstract summary: This paper tackles a new problem in computer vision: mid-stream video-to-video retrieval.
We present the first hashing framework that infers the unseen future content of a currently playing video.
Our approach also yields a significant mAP@20 performance increase compared to a baseline adapted from the literature for this task.
- Score: 12.17757623963458
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper tackles a new problem in computer vision: mid-stream
video-to-video retrieval. This task, which consists in searching a database for
content similar to a video right as it is playing, e.g. from a live stream,
exhibits challenging characteristics. Only the beginning part of the video is
available as query and new frames are constantly added as the video plays out.
To perform retrieval in this demanding situation, we propose an approach based
on a binary encoder that is both predictive and incremental in order to (1)
account for the missing video content at query time and (2) keep up with
repeated, continuously evolving queries throughout the streaming. In
particular, we present the first hashing framework that infers the unseen
future content of a currently playing video. Experiments on FCVID and
ActivityNet demonstrate the feasibility of this task. Our approach also yields
a significant mAP@20 performance increase compared to a baseline adapted from
the literature for this task, for instance 7.4% (2.6%) increase at 20% (50%) of
elapsed runtime on FCVID using bitcodes of size 192 bits.
Related papers
- T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval [30.48217069475297]
We introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers.
T2VIndexer aims to reduce retrieval time while maintaining high accuracy.
arXiv Detail & Related papers (2024-08-21T08:40:45Z) - EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval [52.375143786641196]
EgoCVR is an evaluation benchmark for fine-grained Composed Video Retrieval.
EgoCVR consists of 2,295 queries that specifically focus on high-quality temporal video understanding.
arXiv Detail & Related papers (2024-07-23T17:19:23Z) - Goldfish: Vision-Language Understanding of Arbitrarily Long Videos [51.547065479762715]
We present a methodology tailored for comprehending videos of arbitrary lengths.
We also introduce the TVQA-long benchmark, designed to evaluate models' capabilities in understanding long videos with questions in both vision and text content.
Our results indicate that our models have significant improvements in both long and short-video understanding.
arXiv Detail & Related papers (2024-07-17T15:59:32Z) - Streaming Dense Video Captioning [85.70265343236687]
An ideal model for dense video captioning should be able to handle long input videos, predict rich, detailed textual descriptions.
Current state-of-the-art models process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video.
We propose a streaming dense video captioning model that consists of two novel components.
arXiv Detail & Related papers (2024-04-01T17:59:15Z) - Judging a video by its bitstream cover [12.322783570127756]
Classifying videos into distinct categories, such as Sport and Music Video, is crucial for multimedia understanding and retrieval.
Traditional methods require video decompression to extract pixel-level features like color, texture, and motion.
We present a novel approach that examines only the post-compression bitstream of a video to perform classification, eliminating the need for bitstream.
arXiv Detail & Related papers (2023-09-14T00:34:11Z) - Contrastive Masked Autoencoders for Self-Supervised Video Hashing [54.636976693527636]
Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision.
We propose a simple yet effective one-stage SSVH method called ConMH, which incorporates video semantic information and video similarity relationship understanding.
arXiv Detail & Related papers (2022-11-21T06:48:14Z) - Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval [55.088635195893325]
We propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ)
HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos.
Experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods.
arXiv Detail & Related papers (2022-02-07T18:04:10Z) - BridgeFormer: Bridging Video-text Retrieval with Multiple Choice
Questions [38.843518809230524]
We introduce a novel pretext task dubbed Multiple Choice Questions (MCQ)
A module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features.
In the form of questions and answers, the semantic associations between local video-text features can be properly established.
Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets.
arXiv Detail & Related papers (2022-01-13T09:33:54Z) - Self-supervised Video Representation Learning by Context and Motion
Decoupling [45.510042484456854]
A challenge in self-supervised video representation learning is how to effectively capture motion information besides context bias.
We develop a method that explicitly decouples motion supervision from context bias through a carefully designed pretext task.
Experiments show that our approach improves the quality of the learned video representation over previous works.
arXiv Detail & Related papers (2021-04-02T02:47:34Z) - Convolutional Hierarchical Attention Network for Query-Focused Video
Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs.
We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module.
In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.