KeyVideoLLM: Towards Large-scale Video Keyframe Selection
- URL: http://arxiv.org/abs/2407.03104v1
- Date: Wed, 3 Jul 2024 13:41:44 GMT
- Title: KeyVideoLLM: Towards Large-scale Video Keyframe Selection
- Authors: Hao Liang, Jiapeng Li, Tianyi Bai, Chong Chen, Conghui He, Bin Cui, Wentao Zhang,
- Abstract summary: KeyVideoLLM is a text-video frame similarity-based selection method designed to manage VideoLLM data efficiently.
It achieves a remarkable data compression rate of up to 60.9 times, substantially lowering disk space requirements.
It enhances processing speed by up to 200 times compared to existing selection methods.
- Score: 40.683097609942365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, with the rise of web videos, managing and understanding large-scale video datasets has become increasingly important. Video Large Language Models (VideoLLMs) have emerged in recent years due to their strong video understanding capabilities. However, training and inference processes for VideoLLMs demand vast amounts of data, presenting significant challenges to data management, particularly regarding efficiency, robustness, and effectiveness. In this work, we present KeyVideoLLM, a text-video frame similarity-based keyframe selection method designed to manage VideoLLM data efficiently, robustly, and effectively. Specifically, KeyVideoLLM achieves a remarkable data compression rate of up to 60.9 times, substantially lowering disk space requirements, which proves its high efficiency. Additionally, it maintains a 100% selection success rate across all video formats and scales, enhances processing speed by up to 200 times compared to existing keyframe selection methods, and does not require hyperparameter tuning. Beyond its outstanding efficiency and robustness, KeyVideoLLM further improves model performance in video question-answering tasks during both training and inference stages. Notably, it consistently achieved the state-of-the-art (SoTA) experimental results on diverse datasets.
Related papers
- OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer [14.503628667535425]
processing extensive videos presents significant challenges due to the vast data and processing demands.
We develop OmAgent, efficiently stores and retrieves relevant video frames for specific queries.
It features an Divide-and-Conquer Loop capable of autonomous reasoning.
We have endowed it with greater autonomy and a robust tool-calling system, enabling it to accomplish even more intricate tasks.
arXiv Detail & Related papers (2024-06-24T13:05:39Z) - TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking [33.75267864844047]
Video Object (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings.
We propose a novel, clip-based DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges.
Specifically, we propose a novel transformation-aware loss that focuses learning on portions of the video where an object undergoes significant deformations.
arXiv Detail & Related papers (2023-12-13T21:02:03Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Characterizing Video Question Answering with Sparsified Inputs [55.7455981156755]
We characterize a task with different input sparsity and provide a tool for doing that.
Specifically, we use a Gumbel-based learnable selection module to adaptively select the best inputs for the final task.
From our experiments, we have observed only 5.2%-5.8% loss of performance with only 10% of video lengths.
arXiv Detail & Related papers (2023-11-27T21:00:20Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - VideoMAE: Masked Autoencoders are Data-Efficient Learners for
Self-Supervised Video Pre-Training [49.68815656405452]
We show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP)
We are inspired by the recent ImageMAE and propose customized video tube masking and reconstruction.
Our VideoMAE with the vanilla ViT backbone can achieve 83.9% on Kinects-400, 75.3% on Something-Something V2, 90.8% on UCF101, and 61.1% on HMDB51 without using any extra data.
arXiv Detail & Related papers (2022-03-23T17:55:10Z) - Self-supervised Video Retrieval Transformer Network [10.456881328982586]
We propose SVRTN, which applies self-supervised training to learn video representation from unlabeled data.
It exploits transformer structure to aggregate frame-level features into clip-level to reduce both storage space and search complexity.
It can learn the complementary and discriminative information from the interactions among clip frames, as well as acquire the frame permutation and missing invariant ability to support more flexible retrieval manners.
arXiv Detail & Related papers (2021-04-16T09:43:45Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.