Vision-Language Models Learn Super Images for Efficient Partially
Relevant Video Retrieval
- URL: http://arxiv.org/abs/2312.00414v2
- Date: Tue, 12 Mar 2024 02:39:23 GMT
- Title: Vision-Language Models Learn Super Images for Efficient Partially
Relevant Video Retrieval
- Authors: Taichi Nishimura and Shota Nakada and Masayoshi Kondo
- Abstract summary: We propose an efficient and high-performance method for partially relevant video retrieval.
It aims to retrieve long videos that contain at least one moment relevant to the input text query.
- Score: 2.303098021872002
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose an efficient and high-performance method for
partially relevant video retrieval, which aims to retrieve long videos that
contain at least one moment relevant to the input text query. The challenge
lies in encoding dense frames using visual backbones. This requires models to
handle the increased frames, resulting in significant computation costs for
long videos. To mitigate the costs, previous studies use lightweight visual
backbones, yielding sub-optimal retrieval performance due to their limited
capabilities. However, it is undesirable to simply replace the backbones with
high-performance large vision-and-language models (VLMs) due to their low
efficiency. To address this dilemma, instead of dense frames, we focus on super
images, which are created by rearranging the video frames in an $N \times N$
grid layout. This reduces the number of visual encodings to $\frac{1}{N^2}$ and
mitigates the low efficiency of large VLMs. Based on this idea, we make two
contributions. First, we explore whether VLMs generalize to super images in a
zero-shot setting. To this end, we propose a method called query-attentive
super image retrieval (QASIR), which attends to partial moments relevant to the
input query. The zero-shot QASIR yields two discoveries: (1) it enables VLMs to
generalize to super images and (2) the grid size $N$, image resolution, and VLM
size are key trade-off parameters between performance and computation costs.
Second, we introduce fine-tuning and hybrid QASIR that combines high- and
low-efficiency models to strike a balance between performance and computation
costs. This reveals two findings: (1) the fine-tuning QASIR enhances VLMs to
learn super images effectively, and (2) the hybrid QASIR minimizes the
performance drop of large VLMs while reducing the computation costs.
Related papers
- AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction [10.579335027350263]
AdaCM$2$ is an adaptive cross-modality memory reduction approach to video-text alignment on video streams.
It achieves a 4.5% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65%.
arXiv Detail & Related papers (2024-11-19T18:04:13Z) - Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [77.59651787115546]
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity.
We propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM.
ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens.
arXiv Detail & Related papers (2024-05-24T17:34:15Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models [41.12711820047315]
Video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem.
We propose two frame sampling strategies, namely the most domain frames (MDF) and most implied frames (MIF), to maximally preserve those frames that are most likely vital to the given questions.
arXiv Detail & Related papers (2023-07-09T14:54:30Z) - Is a Video worth $n\times n$ Images? A Highly Efficient Approach to
Transformer-based Video Question Answering [14.659023742381777]
Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames independently through one or more image encoders followed by interaction between frames and question.
We present a highly efficient approach for VideoQA based on existing vision-language pre-trained models where we video frames to a $ntimes n$ matrix and then convert it to one image.
arXiv Detail & Related papers (2023-05-16T02:12:57Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - Deep Space-Time Video Upsampling Networks [47.62807427163614]
Video super-resolution (VSR) and frame (FI) are traditional computer vision problems.
We propose an end-to-end framework for the space-time video upsampling by efficiently merging VSR and FI into a joint framework.
Results show better results both quantitatively and qualitatively, while reducing the time (x7 faster) and the number of parameters (30%) compared to baselines.
arXiv Detail & Related papers (2020-04-06T07:04:21Z) - Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video
Super-Resolution [95.26202278535543]
A simple solution is to split it into two sub-tasks: video frame (VFI) and video super-resolution (VSR)
temporalsynthesis and spatial super-resolution are intra-related in this task.
We propose a one-stage space-time video super-resolution framework, which directly synthesizes an HR slow-motion video from an LFR, LR video.
arXiv Detail & Related papers (2020-02-26T16:59:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.