Towards Micro-video Thumbnail Selection via a Multi-label
Visual-semantic Embedding Model
- URL: http://arxiv.org/abs/2202.02930v1
- Date: Mon, 7 Feb 2022 04:15:26 GMT
- Title: Towards Micro-video Thumbnail Selection via a Multi-label
Visual-semantic Embedding Model
- Authors: Liu Bo
- Abstract summary: The thumbnail, as the first sight of a micro-video, plays a pivotal role in attracting users to click and watch.
We present a multi-label visual-semantic embedding model to estimate the similarity between the pair of each frame and the popular topics that users are interested in.
We fuse the visual representation score and the popularity score of each frame to select the attractive thumbnail for the given micro-video.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The thumbnail, as the first sight of a micro-video, plays a pivotal role in
attracting users to click and watch. While in the real scenario, the more the
thumbnails satisfy the users, the more likely the micro-videos will be clicked.
In this paper, we aim to select the thumbnail of a given micro-video that meets
most users` interests. Towards this end, we present a multi-label
visual-semantic embedding model to estimate the similarity between the pair of
each frame and the popular topics that users are interested in. In this model,
the visual and textual information is embedded into a shared semantic space,
whereby the similarity can be measured directly, even the unseen words.
Moreover, to compare the frame to all words from the popular topics, we devise
an attention embedding space associated with the semantic-attention projection.
With the help of these two embedding spaces, the popularity score of a frame,
which is defined by the sum of similarity scores over the corresponding visual
information and popular topic pairs, is achieved. Ultimately, we fuse the
visual representation score and the popularity score of each frame to select
the attractive thumbnail for the given micro-video. Extensive experiments
conducted on a real-world dataset have well-verified that our model
significantly outperforms several state-of-the-art baselines.
Related papers
- Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - Segment Anything Meets Point Tracking [116.44931239508578]
This paper presents a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking.
We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark.
Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions.
arXiv Detail & Related papers (2023-07-03T17:58:01Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [62.265410865423]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - Supervised Video Summarization via Multiple Feature Sets with Parallel
Attention [4.931399476945033]
We suggest a novel model architecture that combines three feature sets for visual content and motion to predict importance scores.
The proposed architecture utilizes an attention mechanism before fusing motion features and features representing the (static) visual content.
Comprehensive experimental evaluations are reported for two well-known datasets, SumMe and TVSum.
arXiv Detail & Related papers (2021-04-23T10:46:35Z) - Multiview Pseudo-Labeling for Semi-supervised Learning from Video [102.36355560553402]
We present a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video.
Our method capitalizes on multiple views, but it nonetheless trains a model that is shared across appearance and motion input.
On multiple video recognition datasets, our method substantially outperforms its supervised counterpart, and compares favorably to previous work on standard benchmarks in self-supervised video representation learning.
arXiv Detail & Related papers (2021-04-01T17:59:48Z) - Modeling High-order Interactions across Multi-interests for Micro-video
Reommendation [65.16624625748068]
We propose a Self-over-Co Attention module to enhance user's interest representation.
In particular, we first use co-attention to model correlation patterns across different levels and then use self-attention to model correlation patterns within a specific level.
arXiv Detail & Related papers (2021-04-01T07:20:15Z) - A Multi-modal Deep Learning Model for Video Thumbnail Selection [0.0]
A good thumbnail should be a frame that best represents the content of a video while at the same time capturing viewers' attention.
In this paper, we expand the definition of content to include title, description, and audio of a video and utilize information provided by these modalities in our selection model.
To the best of our knowledge, we are the first to propose a multi-modal deep learning model to select video thumbnail, which beats the result from the previous State-of-The-Art models.
arXiv Detail & Related papers (2020-12-31T21:10:09Z) - Comprehensive Information Integration Modeling Framework for Video
Titling [124.11296128308396]
We integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework.
To tackle this issue, the proposed method consists of two processes, i.e., granular-level interaction modeling and abstraction-level story-line summarization.
We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform.
arXiv Detail & Related papers (2020-06-24T10:38:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.