Show Me What I Like: Detecting User-Specific Video Highlights Using
Content-Based Multi-Head Attention
- URL: http://arxiv.org/abs/2207.08352v2
- Date: Tue, 19 Jul 2022 04:59:56 GMT
- Title: Show Me What I Like: Detecting User-Specific Video Highlights Using
Content-Based Multi-Head Attention
- Authors: Uttaran Bhattacharya and Gang Wu and Stefano Petrangeli and
Viswanathan Swaminathan and Dinesh Manocha
- Abstract summary: We propose a method to detect individualized highlights for users on given target videos based on their preferred highlight clips marked on previous videos they have watched.
Our method explicitly leverages the contents of both the preferred clips and the target videos using pre-trained features for the objects and the human activities.
- Score: 58.44096082508686
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a method to detect individualized highlights for users on given
target videos based on their preferred highlight clips marked on previous
videos they have watched. Our method explicitly leverages the contents of both
the preferred clips and the target videos using pre-trained features for the
objects and the human activities. We design a multi-head attention mechanism to
adaptively weigh the preferred clips based on their object- and
human-activity-based contents, and fuse them using these weights into a single
feature representation for each user. We compute similarities between these
per-user feature representations and the per-frame features computed from the
desired target videos to estimate the user-specific highlight clips from the
target videos. We test our method on a large-scale highlight detection dataset
containing the annotated highlights of individual users. Compared to current
baselines, we observe an absolute improvement of 2-4% in the mean average
precision of the detected highlights. We also perform extensive ablation
experiments on the number of preferred highlight clips associated with each
user as well as on the object- and human-activity-based feature representations
to validate that our method is indeed both content-based and user-specific.
Related papers
- Personalized Video Summarization by Multimodal Video Understanding [2.1372652192505703]
We present a pipeline called Video Summarization with Language (VSL) for user-preferred video summarization.
VSL is based on pre-trained visual language models (VLMs) to avoid the need to train a video summarization system on a large training dataset.
We show that our method is more adaptable across different datasets compared to supervised query-based video summarization models.
arXiv Detail & Related papers (2024-11-05T22:14:35Z) - Learning User Embeddings from Human Gaze for Personalised Saliency Prediction [12.361829928359136]
We present a novel method to extract user embeddings from pairs of natural images and corresponding saliency maps.
At the core of our method is a Siamese convolutional neural encoder that learns the user embeddings by contrasting the image and personal saliency map pairs of different users.
arXiv Detail & Related papers (2024-03-20T14:58:40Z) - Learning Pixel-Level Distinctions for Video Highlight Detection [39.23271866827123]
We propose to learn pixel-level distinctions to improve the video highlight detection.
This pixel-level distinction indicates whether or not each pixel in one video belongs to an interesting section.
We design an encoder-decoder network to estimate the pixel-level distinction.
arXiv Detail & Related papers (2022-04-10T06:41:16Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [62.265410865423]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - PR-Net: Preference Reasoning for Personalized Video Highlight Detection [34.71807317380797]
We propose a simple yet efficient preference reasoning framework (PR-Net) to explicitly take the diverse interests into account for frame-level highlight prediction.
Our method significantly outperforms state-of-the-art methods with a relative improvement of 12% in mean accuracy precision.
arXiv Detail & Related papers (2021-09-04T06:12:13Z) - Cross-category Video Highlight Detection via Set-based Learning [55.49267044910344]
We propose a Dual-Learner-based Video Highlight Detection (DL-VHD) framework.
It learns the distinction of target category videos and the characteristics of highlight moments on source video category.
It outperforms five typical Unsupervised Domain Adaptation (UDA) algorithms on various cross-category highlight detection tasks.
arXiv Detail & Related papers (2021-08-26T13:06:47Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Hybrid Dynamic-static Context-aware Attention Network for Action
Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos.
We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames.
We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.