Unboxing Engagement in YouTube Influencer Videos: An Attention-Based Approach
- URL: http://arxiv.org/abs/2012.12311v6
- Date: Sun, 11 May 2025 16:59:53 GMT
- Title: Unboxing Engagement in YouTube Influencer Videos: An Attention-Based Approach
- Authors: Prashant Rajaram, Puneet Manchanda,
- Abstract summary: "What is said" through words (text) is more important than "how it is said" through imagery (video images) or acoustics (audio) in predicting video engagement.<n>We analyze unstructured data from long-form YouTube influencer videos.
- Score: 0.3686808512438362
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Influencer marketing has become a widely used strategy for reaching customers. Despite growing interest among influencers and brand partners in predicting engagement with influencer videos, there has been little research on the relative importance of different video data modalities in predicting engagement. We analyze unstructured data from long-form YouTube influencer videos - spanning text, audio, and video images - using an interpretable deep learning framework that leverages model attention to video elements. This framework enables strong out-of-sample prediction, followed by ex-post interpretation using a novel approach that prunes spurious associations. Our prediction-based results reveal that "what is said" through words (text) is more important than "how it is said" through imagery (video images) or acoustics (audio) in predicting video engagement. Interpretation-based findings show that during the critical onset period of a video (first 30 seconds), auditory stimuli (e.g., brand mentions and music) are associated with sentiment expressed in verbal engagement (comments), while visual stimuli (e.g., video images of humans and packaged goods) are linked with sentiment expressed through non-verbal engagement (the thumbs-up/down ratio). We validate our approach through multiple methods, connect our findings to relevant theory, and discuss implications for influencers, brands and agencies.
Related papers
- DreamRelation: Relation-Centric Video Customization [33.65405972817795]
Video customization refers to the creation of personalized videos that depict user-specified relations between two subjects.
While existing methods can personalize subject appearances and motions, they still struggle with complex video customization.
We propose DreamRelation, a novel approach capturing a small set of videos, leveraging two key components: Decoupling Learning and Dynamics Enhancement.
arXiv Detail & Related papers (2025-03-10T17:58:03Z) - Understanding Long Videos via LLM-Powered Entity Relation Graphs [51.13422967711056]
GraphVideoAgent is a framework that maps and monitors the evolving relationships between visual entities throughout the video sequence.
Our approach demonstrates remarkable effectiveness when tested against industry benchmarks.
arXiv Detail & Related papers (2025-01-27T10:57:24Z) - Admitting Ignorance Helps the Video Question Answering Models to Answer [82.22149677979189]
We argue that models often establish shortcuts, resulting in spurious correlations between questions and answers.
We propose a novel training framework in which the model is compelled to acknowledge its ignorance when presented with an intervened question.
In practice, we integrate a state-of-the-art model into our framework to validate its effectiveness.
arXiv Detail & Related papers (2025-01-15T12:44:52Z) - Enhancing Multi-Modal Video Sentiment Classification Through Semi-Supervised Clustering [0.0]
We aim to improve video sentiment classification by focusing on two key aspects: the video itself, the accompanying text, and the acoustic features.
We are developing a method that utilizes clustering-based semi-supervised pre-training to extract meaningful representations from the data.
arXiv Detail & Related papers (2025-01-11T08:04:39Z) - Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - Compositional Video Generation as Flow Equalization [72.88137795439407]
Large-scale Text-to-Video (T2V) diffusion models have recently demonstrated unprecedented capability to transform natural language descriptions into stunning and photorealistic videos.
Despite the promising results, these models struggle to fully grasp complex compositional interactions between multiple concepts and actions.
We introduce bftextVico, a generic framework for compositional video generation that explicitly ensures all concepts are represented properly.
arXiv Detail & Related papers (2024-06-10T16:27:47Z) - Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue [63.32199372362483]
We propose a novel sEntiment-enhanceD Graph-based multimodal sarcasm Explanation framework, named EDGE.<n>In particular, we first propose a lexicon-guided utterance sentiment inference module, where a utterance sentiment refinement strategy is devised.<n>We then develop a module named Joint Cross Attention-based Sentiment Inference (JCA-SI) by extending the multimodal sentiment analysis model JCA to derive the joint sentiment label for each video-audio clip.
arXiv Detail & Related papers (2024-02-06T03:14:46Z) - Micro-video Tagging via Jointly Modeling Social Influence and Tag
Relation [56.23157334014773]
85.7% of micro-videos lack annotation.
Existing methods mostly focus on analyzing video content, neglecting users' social influence and tag relation.
We formulate micro-video tagging as a link prediction problem in a constructed heterogeneous network.
arXiv Detail & Related papers (2023-03-15T02:13:34Z) - How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z) - Predicting emotion from music videos: exploring the relative
contribution of visual and auditory information to affective responses [0.0]
We present MusicVideos (MuVi), a novel dataset for affective multimedia content analysis.
The data were collected by presenting music videos to participants in three conditions: music, visual, and audiovisual.
arXiv Detail & Related papers (2022-02-19T07:36:43Z) - MERLOT Reserve: Neural Script Knowledge through Vision and Language and
Sound [90.1857707251566]
We introduce MERLOT Reserve, a model that represents videos jointly over time.
We replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
Our objective learns faster than alternatives, and performs well at scale.
arXiv Detail & Related papers (2022-01-07T19:00:21Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - E-ffective: A Visual Analytic System for Exploring the Emotion and
Effectiveness of Inspirational Speeches [57.279044079196105]
E-ffective is a visual analytic system allowing speaking experts and novices to analyze both the role of speech factors and their contribution in effective speeches.
Two novel visualizations include E-spiral (that shows the emotional shifts in speeches in a visually compact way) and E-script (that connects speech content with key speech delivery information.
arXiv Detail & Related papers (2021-10-28T06:14:27Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Identity-aware Graph Memory Network for Action Detection [37.65846189707054]
We explicitly highlight the identity information of the actors in terms of both long-term and short-term context through a graph memory network.
Specifically, we propose the hierarchical graph neural network (IGNN) to comprehensively conduct long-term relation modeling.
We develop a dual attention module (DAM) to generate identity-aware constraint to reduce the influence of interference by the actors of different identities.
arXiv Detail & Related papers (2021-08-26T02:34:55Z) - Relation-aware Hierarchical Attention Framework for Video Question
Answering [6.312182279855817]
We propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos.
In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features.
We consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer.
arXiv Detail & Related papers (2021-05-13T09:35:42Z) - The Influence of Audio on Video Memorability with an Audio Gestalt
Regulated Video Memorability System [1.8506048493564673]
We find evidence to suggest that audio can facilitate overall video recognition memorability rich in high-level (gestalt) audio features.
We introduce a novel multimodal deep learning-based late-fusion system that uses audio gestalt to estimate the influence of a given video's audio on its overall short-term recognition memorability.
We benchmark our audio gestalt based system on the Memento10k short-term video memorability dataset, achieving top-2 state-of-the-art results.
arXiv Detail & Related papers (2021-04-23T12:53:33Z) - Modeling High-order Interactions across Multi-interests for Micro-video
Reommendation [65.16624625748068]
We propose a Self-over-Co Attention module to enhance user's interest representation.
In particular, we first use co-attention to model correlation patterns across different levels and then use self-attention to model correlation patterns within a specific level.
arXiv Detail & Related papers (2021-04-01T07:20:15Z) - Proactive Pseudo-Intervention: Causally Informed Contrastive Learning
For Interpretable Vision Models [103.64435911083432]
We present a novel contrastive learning strategy called it Proactive Pseudo-Intervention (PPI)
PPI leverages proactive interventions to guard against image features with no causal relevance.
We also devise a novel causally informed salience mapping module to identify key image pixels to intervene, and show it greatly facilitates model interpretability.
arXiv Detail & Related papers (2020-12-06T20:30:26Z) - How-to Present News on Social Media: A Causal Analysis of Editing News
Headlines for Boosting User Engagement [14.829079057399838]
We analyze media outlets' current practices using a data-driven approach.
We build a parallel corpus of original news articles and their corresponding tweets that eight media outlets shared.
Then, we explore how those media edited tweets against original headlines and the effects of such changes.
arXiv Detail & Related papers (2020-09-17T06:39:49Z) - A Blast From the Past: Personalizing Predictions of Video-Induced
Emotions using Personal Memories as Context [5.1314912554605066]
We show that automatic analysis of text describing their video-triggered memories can account for variation in viewers' emotional responses.
We discuss the relevance of these findings for improving on state of the art approaches to automated affective video analysis in personalized contexts.
arXiv Detail & Related papers (2020-08-27T13:06:10Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z) - Self-Supervised Joint Encoding of Motion and Appearance for First Person
Action Recognition [19.93779132095822]
We argue that learning features jointly intertwine from these two information channels is beneficial.
We propose a single stream architecture able to do so, thanks to the addition of a self-supervised motion prediction block.
Experiments on several publicly available databases show the power of our approach.
arXiv Detail & Related papers (2020-02-10T17:51:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.