VVS: Video-to-Video Retrieval with Irrelevant Frame Suppression
- URL: http://arxiv.org/abs/2303.08906v2
- Date: Tue, 19 Dec 2023 09:16:31 GMT
- Title: VVS: Video-to-Video Retrieval with Irrelevant Frame Suppression
- Authors: Won Jo, Geuntaek Lim, Gwangjin Lee, Hyunwoo Kim, Byungsoo Ko, and
Yukyung Choi
- Abstract summary: We show that appropriate suppression of irrelevant frames can provide insight into the current obstacles of the video-level approaches.
We propose a Video-to-Video Suppression network (VVS) as a solution.
VVS is an end-to-end framework that consists of an easy distractor elimination stage to identify which frames to remove and a suppression weight generation stage to determine the extent to suppress the remaining frames.
- Score: 12.793922882841137
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In content-based video retrieval (CBVR), dealing with large-scale
collections, efficiency is as important as accuracy; thus, several video-level
feature-based studies have actively been conducted. Nevertheless, owing to the
severe difficulty of embedding a lengthy and untrimmed video into a single
feature, these studies have been insufficient for accurate retrieval compared
to frame-level feature-based studies. In this paper, we show that appropriate
suppression of irrelevant frames can provide insight into the current obstacles
of the video-level approaches. Furthermore, we propose a Video-to-Video
Suppression network (VVS) as a solution. VVS is an end-to-end framework that
consists of an easy distractor elimination stage to identify which frames to
remove and a suppression weight generation stage to determine the extent to
suppress the remaining frames. This structure is intended to effectively
describe an untrimmed video with varying content and meaningless information.
Its efficacy is proved via extensive experiments, and we show that our approach
is not only state-of-the-art in video-level approaches but also has a fast
inference time despite possessing retrieval capabilities close to those of
frame-level approaches. Code is available at https://github.com/sejong-rcv/VVS
Related papers
- Video Dynamics Prior: An Internal Learning Approach for Robust Video
Enhancements [83.5820690348833]
We present a framework for low-level vision tasks that does not require any external training data corpus.
Our approach learns neural modules by optimizing over a corrupted sequence, leveraging the weights of the coherence-temporal test and statistics internal statistics.
arXiv Detail & Related papers (2023-12-13T01:57:11Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Video Event Restoration Based on Keyframes for Video Anomaly Detection [9.18057851239942]
Existing deep neural network based anomaly detection (VAD) methods mostly follow the route of frame reconstruction or frame prediction.
We introduce a brand-new VAD paradigm to break through these limitations.
We propose a novel U-shaped Swin Transformer Network with Dual Skip Connections (USTN-DSC) for video event restoration.
arXiv Detail & Related papers (2023-04-11T10:13:19Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement
Learning Method [6.172652648945223]
This paper presents a novel weakly-supervised methodology to accelerate instructional videos using text.
A novel joint reward function guides our agent to select which frames to remove and reduce the input video to a target length.
We also propose the Extended Visually-guided Document Attention Network (VDAN+), which can generate a highly discriminative embedding space.
arXiv Detail & Related papers (2022-03-29T17:43:01Z) - Memory-Augmented Non-Local Attention for Video Super-Resolution [61.55700315062226]
We propose a novel video super-resolution method that aims at generating high-fidelity high-resolution (HR) videos from low-resolution (LR) ones.
Previous methods predominantly leverage temporal neighbor frames to assist the super-resolution of the current frame.
In contrast, we devise a cross-frame non-local attention mechanism that allows video super-resolution without frame alignment.
arXiv Detail & Related papers (2021-08-25T05:12:14Z) - Self-Conditioned Probabilistic Learning of Video Rescaling [70.10092286301997]
We propose a self-conditioned probabilistic framework for video rescaling to learn the paired downscaling and upscaling procedures simultaneously.
We decrease the entropy of the information lost in the downscaling by maximizing its conditioned probability on the strong spatial-temporal prior information.
We extend the framework to a lossy video compression system, in which a gradient estimator for non-differential industrial lossy codecs is proposed.
arXiv Detail & Related papers (2021-07-24T15:57:15Z) - Self-supervised Video Retrieval Transformer Network [10.456881328982586]
We propose SVRTN, which applies self-supervised training to learn video representation from unlabeled data.
It exploits transformer structure to aggregate frame-level features into clip-level to reduce both storage space and search complexity.
It can learn the complementary and discriminative information from the interactions among clip frames, as well as acquire the frame permutation and missing invariant ability to support more flexible retrieval manners.
arXiv Detail & Related papers (2021-04-16T09:43:45Z) - A Sparse Sampling-based framework for Semantic Fast-Forward of
First-Person Videos [2.362412515574206]
Most uploaded videos are doomed to be forgotten and unwatched stashed away in some computer folder or website.
We present a new adaptive frame selection formulated as a weighted minimum reconstruction problem.
Our method is able to retain as much relevant information and smoothness as the state-of-the-art techniques, but in less processing time.
arXiv Detail & Related papers (2020-09-21T18:36:17Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.