CLUE: Contextualised Unified Explainable Learning of User Engagement in
Video Lectures
- URL: http://arxiv.org/abs/2201.05651v1
- Date: Fri, 14 Jan 2022 19:51:06 GMT
- Title: CLUE: Contextualised Unified Explainable Learning of User Engagement in
Video Lectures
- Authors: Sujit Roy, Gnaneswara Rao Gorle, Vishal Gaur, Haider Raza, Shoaib
Jameel
- Abstract summary: We propose a new unified model, CLUE, which learns from the features extracted from public online teaching videos.
Our model exploits various multi-modal features to model the complexity of language, context information, textual emotion of the delivered content.
- Score: 6.25256391074865
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting contextualised engagement in videos is a long-standing problem
that has been popularly attempted by exploiting the number of views or the
associated likes using different computational methods. The recent decade has
seen a boom in online learning resources, and during the pandemic, there has
been an exponential rise of online teaching videos without much quality
control. The quality of the content could be improved if the creators could get
constructive feedback on their content. Employing an army of domain expert
volunteers to provide feedback on the videos might not scale. As a result,
there has been a steep rise in developing computational methods to predict a
user engagement score that is indicative of some form of possible user
engagement, i.e., to what level a user would tend to engage with the content. A
drawback in current methods is that they model various features separately, in
a cascaded approach, that is prone to error propagation. Besides, most of them
do not provide crucial explanations on how the creator could improve their
content. In this paper, we have proposed a new unified model, CLUE for the
educational domain, which learns from the features extracted from freely
available public online teaching videos and provides explainable feedback on
the video along with a user engagement score. Given the complexity of the task,
our unified framework employs different pre-trained models working together as
an ensemble of classifiers. Our model exploits various multi-modal features to
model the complexity of language, context agnostic information, textual emotion
of the delivered content, animation, speaker's pitch and speech emotions. Under
a transfer learning setup, the overall model, in the unified space, is
fine-tuned for downstream applications.
Related papers
- ExpertAF: Expert Actionable Feedback from Video [81.46431188306397]
We introduce a novel method to generate actionable feedback from video of a person doing a physical activity.
Our method takes a video demonstration and its accompanying 3D body pose and generates expert commentary.
Our method is able to reason across multi-modal input combinations to output full-spectrum, actionable coaching.
arXiv Detail & Related papers (2024-08-01T16:13:07Z) - Video Annotator: A framework for efficiently building video classifiers
using vision-language models and active learning [0.0]
Video Annotator (VA) is a framework for annotating, managing, and iterating on video classification datasets.
VA allows for a continuous annotation process, seamlessly integrating data collection and model training.
VA achieves a median 6.8 point improvement in Average Precision relative to the most competitive baseline.
arXiv Detail & Related papers (2024-02-09T17:19:05Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Language as the Medium: Multimodal Video Classification through text
only [3.744589644319257]
We propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information.
Our method leverages the extensive knowledge learnt by large language models, such as GPT-3.5 or Llama2.
Our evaluations on popular action recognition benchmarks, such as UCF-101 or Kinetics, show these context-rich descriptions can be successfully used in video understanding tasks.
arXiv Detail & Related papers (2023-09-19T17:32:21Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - VLEngagement: A Dataset of Scientific Video Lectures for Evaluating
Population-based Engagement [23.078055803229912]
Video lectures have become one of the primary modalities to impart knowledge to masses in the current digital age.
There is still an important need for data and research aimed at understanding learner engagement with scientific video lectures.
This paper introduces VLEngagement, a novel dataset that consists of content-based and video-specific features extracted from publicly available scientific video lectures.
arXiv Detail & Related papers (2020-11-02T14:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.