Video SemNet: Memory-Augmented Video Semantic Network
- URL: http://arxiv.org/abs/2011.10909v1
- Date: Sun, 22 Nov 2020 01:36:37 GMT
- Title: Video SemNet: Memory-Augmented Video Semantic Network
- Authors: Prashanth Vijayaraghavan, Deb Roy
- Abstract summary: We propose a machine learning approach to capture the narrative elements in movies by bridging the gap between the low-level data representations and semantic aspects of the visual medium.
We present a Memory-Augmented Video Semantic Network, called Video SemNet, to encode the semantic descriptors and learn an embedding for the video.
We demonstrate that our model is able to predict genres and IMDB ratings with a weighted F-1 score of 0.72 and 0.63 respectively.
- Score: 14.64546899992196
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stories are a very compelling medium to convey ideas, experiences, social and
cultural values. Narrative is a specific manifestation of the story that turns
it into knowledge for the audience. In this paper, we propose a machine
learning approach to capture the narrative elements in movies by bridging the
gap between the low-level data representations and semantic aspects of the
visual medium. We present a Memory-Augmented Video Semantic Network, called
Video SemNet, to encode the semantic descriptors and learn an embedding for the
video. The model employs two main components: (i) a neural semantic learner
that learns latent embeddings of semantic descriptors and (ii) a memory module
that retains and memorizes specific semantic patterns from the video. We
evaluate the video representations obtained from variants of our model on two
tasks: (a) genre prediction and (b) IMDB Rating prediction. We demonstrate that
our model is able to predict genres and IMDB ratings with a weighted F-1 score
of 0.72 and 0.63 respectively. The results are indicative of the
representational power of our model and the ability of such representations to
measure audience engagement.
Related papers
- Video In-context Learning [46.40277880351059]
In this paper, we study video in-context learning, where the model starts from an existing video clip and generates diverse potential future sequences.
To achieve this, we provide a clear definition of the task, and train an autoregressive Transformer on video datasets.
We design various evaluation metrics, including both objective and subjective measures, to demonstrate the visual quality and semantic accuracy of generation results.
arXiv Detail & Related papers (2024-07-10T04:27:06Z) - Enhancing Gait Video Analysis in Neurodegenerative Diseases by Knowledge Augmentation in Vision Language Model [10.742625681420279]
Based on a large-scale pre-trained Vision Language Model (VLM), our model learns and improves visual, textual, and numerical representations of patient gait videos.
Results demonstrate that our model not only significantly outperforms state-of-the-art methods in video-based classification tasks but also adeptly decodes the learned class-specific text features into natural language descriptions.
arXiv Detail & Related papers (2024-03-20T17:03:38Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - Adversarial Memory Networks for Action Prediction [95.09968654228372]
Action prediction aims to infer the forthcoming human action with partially-observed videos.
We propose adversarial memory networks (AMemNet) to generate the "full video" feature conditioning on a partial video query.
arXiv Detail & Related papers (2021-12-18T08:16:21Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network
Language Model [26.78064626111014]
In building automatic speech recognition systems, we can exploit the contextual information provided by video metadata.
We first use an attention based method to extract contextual vector representations of video metadata, and use these representations as part of the inputs to a neural language model.
Secondly, we propose a hybrid pointer network approach to explicitly interpolate the word probabilities of the word occurrences in metadata.
arXiv Detail & Related papers (2020-05-15T07:47:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.