Efficient Cross-Modal Video Retrieval with Meta-Optimized Frames
- URL: http://arxiv.org/abs/2210.08452v1
- Date: Sun, 16 Oct 2022 05:35:00 GMT
- Title: Efficient Cross-Modal Video Retrieval with Meta-Optimized Frames
- Authors: Ning Han, Xun Yang, Ee-Peng Lim, Hao Chen, Qianru Sun
- Abstract summary: Cross-modal video retrieval aims to retrieve semantically relevant videos given a text as a query.
A common and simple solution is to uniformly sample a small number of frames from the video as input to ViT.
To get free from this trade-off, this paper introduces an automatic video compression method based on a bilevel optimization program.
- Score: 39.03408879727955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal video retrieval aims to retrieve the semantically relevant videos
given a text as a query, and is one of the fundamental tasks in Multimedia.
Most of top-performing methods primarily leverage Visual Transformer (ViT) to
extract video features [1, 2, 3], suffering from high computational complexity
of ViT especially for encoding long videos. A common and simple solution is to
uniformly sample a small number (say, 4 or 8) of frames from the video (instead
of using the whole video) as input to ViT. The number of frames has a strong
influence on the performance of ViT, e.g., using 8 frames performs better than
using 4 frames yet needs more computational resources, resulting in a
trade-off. To get free from this trade-off, this paper introduces an automatic
video compression method based on a bilevel optimization program (BOP)
consisting of both model-level (i.e., base-level) and frame-level (i.e.,
meta-level) optimizations. The model-level learns a cross-modal video retrieval
model whose input is the "compressed frames" learned by frame-level
optimization. In turn, the frame-level optimization is through gradient descent
using the meta loss of video retrieval model computed on the whole video. We
call this BOP method as well as the "compressed frames" as Meta-Optimized
Frames (MOF). By incorporating MOF, the video retrieval model is able to
utilize the information of whole videos (for training) while taking only a
small number of input frames in actual implementation. The convergence of MOF
is guaranteed by meta gradient descent algorithms. For evaluation, we conduct
extensive experiments of cross-modal video retrieval on three large-scale
benchmarks: MSR-VTT, MSVD, and DiDeMo. Our results show that MOF is a generic
and efficient method to boost multiple baseline methods, and can achieve a new
state-of-the-art performance.
Related papers
- ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation.
We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning.
Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z) - Frame-Voyager: Learning to Query Frames for Video Large Language Models [33.84793162102087]
Video Large Language Models (Video-LLMs) have made remarkable progress in video understanding tasks.
Existing frame selection approaches, such as uniform frame sampling and text-frame retrieval, fail to account for the information density variations in the videos.
We propose Frame-Voyager that learns to query informative frame combinations, based on the given textual queries in the task.
arXiv Detail & Related papers (2024-10-04T08:26:06Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models [41.12711820047315]
Video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem.
We propose two frame sampling strategies, namely the most domain frames (MDF) and most implied frames (MIF), to maximally preserve those frames that are most likely vital to the given questions.
arXiv Detail & Related papers (2023-07-09T14:54:30Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - Neighbor Correspondence Matching for Flow-based Video Frame Synthesis [90.14161060260012]
We introduce a neighbor correspondence matching (NCM) algorithm for flow-based frame synthesis.
NCM is performed in a current-frame-agnostic fashion to establish multi-scale correspondences in the spatial-temporal neighborhoods of each pixel.
coarse-scale module is designed to leverage neighbor correspondences to capture large motion, while the fine-scale module is more efficient to speed up the estimation process.
arXiv Detail & Related papers (2022-07-14T09:17:00Z) - Multimodal Frame-Scoring Transformer for Video Summarization [4.266320191208304]
Multimodal Frame-Scoring Transformer (MFST) framework exploiting visual, text and audio features and scoring a video with respect to frames.
MFST framework first extracts each modality features (visual-text-audio) using pretrained encoders.
MFST trains the multimodal frame-scoring transformer that uses video-text-audio representations as inputs and predicts frame-level scores.
arXiv Detail & Related papers (2022-07-05T05:14:15Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.