A Multi-modal Deep Learning Model for Video Thumbnail Selection
- URL: http://arxiv.org/abs/2101.00073v1
- Date: Thu, 31 Dec 2020 21:10:09 GMT
- Title: A Multi-modal Deep Learning Model for Video Thumbnail Selection
- Authors: Zhifeng Yu, Nanchun Shi
- Abstract summary: A good thumbnail should be a frame that best represents the content of a video while at the same time capturing viewers' attention.
In this paper, we expand the definition of content to include title, description, and audio of a video and utilize information provided by these modalities in our selection model.
To the best of our knowledge, we are the first to propose a multi-modal deep learning model to select video thumbnail, which beats the result from the previous State-of-The-Art models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Thumbnail is the face of online videos. The explosive growth of videos both
in number and variety underpins the importance of a good thumbnail because it
saves potential viewers time to choose videos and even entice them to click on
them. A good thumbnail should be a frame that best represents the content of a
video while at the same time capturing viewers' attention. However, the
techniques and models in the past only focus on frames within a video, and we
believe such narrowed focus leave out much useful information that are part of
a video. In this paper, we expand the definition of content to include title,
description, and audio of a video and utilize information provided by these
modalities in our selection model. Specifically, our model will first sample
frames uniformly in time and return the top 1,000 frames in this subset with
the highest aesthetic scores by a Double-column Convolutional Neural Network,
to avoid the computational burden of processing all frames in downstream task.
Then, the model incorporates frame features extracted from VGG16, text features
from ELECTRA, and audio features from TRILL. These models were selected because
of their results on popular datasets as well as their competitive performances.
After feature extraction, the time-series features, frames and audio, will be
fed into Transformer encoder layers to return a vector representing their
corresponding modality. Each of the four features (frames, title, description,
audios) will pass through a context gating layer before concatenation. Finally,
our model will generate a vector in the latent space and select the frame that
is most similar to this vector in the latent space. To the best of our
knowledge, we are the first to propose a multi-modal deep learning model to
select video thumbnail, which beats the result from the previous
State-of-The-Art models.
Related papers
- Streaming Dense Video Captioning [85.70265343236687]
An ideal model for dense video captioning should be able to handle long input videos, predict rich, detailed textual descriptions.
Current state-of-the-art models process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video.
We propose a streaming dense video captioning model that consists of two novel components.
arXiv Detail & Related papers (2024-04-01T17:59:15Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation [92.55296042611886]
We propose a framework called "Reuse and Diffuse" dubbed $textitVidRD$ to produce more frames following the frames already generated by an LDM.
We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets.
arXiv Detail & Related papers (2023-09-07T08:12:58Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Multimodal Frame-Scoring Transformer for Video Summarization [4.266320191208304]
Multimodal Frame-Scoring Transformer (MFST) framework exploiting visual, text and audio features and scoring a video with respect to frames.
MFST framework first extracts each modality features (visual-text-audio) using pretrained encoders.
MFST trains the multimodal frame-scoring transformer that uses video-text-audio representations as inputs and predicts frame-level scores.
arXiv Detail & Related papers (2022-07-05T05:14:15Z) - Towards Micro-video Thumbnail Selection via a Multi-label
Visual-semantic Embedding Model [0.0]
The thumbnail, as the first sight of a micro-video, plays a pivotal role in attracting users to click and watch.
We present a multi-label visual-semantic embedding model to estimate the similarity between the pair of each frame and the popular topics that users are interested in.
We fuse the visual representation score and the popularity score of each frame to select the attractive thumbnail for the given micro-video.
arXiv Detail & Related papers (2022-02-07T04:15:26Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.