Vision-Language Models Learn Super Images for Efficient Partially
Relevant Video Retrieval
- URL: http://arxiv.org/abs/2312.00414v2
- Date: Tue, 12 Mar 2024 02:39:23 GMT
- Title: Vision-Language Models Learn Super Images for Efficient Partially
Relevant Video Retrieval
- Authors: Taichi Nishimura and Shota Nakada and Masayoshi Kondo
- Abstract summary: We propose an efficient and high-performance method for partially relevant video retrieval.
It aims to retrieve long videos that contain at least one moment relevant to the input text query.
- Score: 2.303098021872002
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose an efficient and high-performance method for
partially relevant video retrieval, which aims to retrieve long videos that
contain at least one moment relevant to the input text query. The challenge
lies in encoding dense frames using visual backbones. This requires models to
handle the increased frames, resulting in significant computation costs for
long videos. To mitigate the costs, previous studies use lightweight visual
backbones, yielding sub-optimal retrieval performance due to their limited
capabilities. However, it is undesirable to simply replace the backbones with
high-performance large vision-and-language models (VLMs) due to their low
efficiency. To address this dilemma, instead of dense frames, we focus on super
images, which are created by rearranging the video frames in an $N \times N$
grid layout. This reduces the number of visual encodings to $\frac{1}{N^2}$ and
mitigates the low efficiency of large VLMs. Based on this idea, we make two
contributions. First, we explore whether VLMs generalize to super images in a
zero-shot setting. To this end, we propose a method called query-attentive
super image retrieval (QASIR), which attends to partial moments relevant to the
input query. The zero-shot QASIR yields two discoveries: (1) it enables VLMs to
generalize to super images and (2) the grid size $N$, image resolution, and VLM
size are key trade-off parameters between performance and computation costs.
Second, we introduce fine-tuning and hybrid QASIR that combines high- and
low-efficiency models to strike a balance between performance and computation
costs. This reveals two findings: (1) the fine-tuning QASIR enhances VLMs to
learn super images effectively, and (2) the hybrid QASIR minimizes the
performance drop of large VLMs while reducing the computation costs.
Related papers
- FastVLM: Efficient Vision Encoding for Vision Language Models [22.41836943083826]
We introduce FastVLM, a model that achieves an optimized trade-off between latency, model size and accuracy.
FastVLM incorporates FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images.
arXiv Detail & Related papers (2024-12-17T20:09:55Z) - Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction [62.8375542401319]
Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone.
The number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs.
We propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep.
arXiv Detail & Related papers (2024-11-30T18:54:32Z) - ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [77.59651787115546]
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity.
We propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM.
ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens.
arXiv Detail & Related papers (2024-05-24T17:34:15Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models [41.12711820047315]
Video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem.
We propose two frame sampling strategies, namely the most domain frames (MDF) and most implied frames (MIF), to maximally preserve those frames that are most likely vital to the given questions.
arXiv Detail & Related papers (2023-07-09T14:54:30Z) - Is a Video worth $n\times n$ Images? A Highly Efficient Approach to
Transformer-based Video Question Answering [14.659023742381777]
Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames independently through one or more image encoders followed by interaction between frames and question.
We present a highly efficient approach for VideoQA based on existing vision-language pre-trained models where we video frames to a $ntimes n$ matrix and then convert it to one image.
arXiv Detail & Related papers (2023-05-16T02:12:57Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - Deep Space-Time Video Upsampling Networks [47.62807427163614]
Video super-resolution (VSR) and frame (FI) are traditional computer vision problems.
We propose an end-to-end framework for the space-time video upsampling by efficiently merging VSR and FI into a joint framework.
Results show better results both quantitatively and qualitatively, while reducing the time (x7 faster) and the number of parameters (30%) compared to baselines.
arXiv Detail & Related papers (2020-04-06T07:04:21Z) - Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video
Super-Resolution [95.26202278535543]
A simple solution is to split it into two sub-tasks: video frame (VFI) and video super-resolution (VSR)
temporalsynthesis and spatial super-resolution are intra-related in this task.
We propose a one-stage space-time video super-resolution framework, which directly synthesizes an HR slow-motion video from an LFR, LR video.
arXiv Detail & Related papers (2020-02-26T16:59:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.