Contextual Explainable Video Representation:\\Human Perception-based
Understanding
- URL: http://arxiv.org/abs/2212.06206v1
- Date: Mon, 12 Dec 2022 19:29:07 GMT
- Title: Contextual Explainable Video Representation:\\Human Perception-based
Understanding
- Authors: Khoa Vo, Kashu Yamazaki, Phong X. Nguyen, Phat Nguyen, Khoa Luu, Ngan
Le
- Abstract summary: We discuss approaches that incorporate the human perception process into modeling actors, objects, and the environment.
We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding.
- Score: 10.172332586182792
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video understanding is a growing field and a subject of intense research,
which includes many interesting tasks to understanding both spatial and
temporal information, e.g., action detection, action recognition, video
captioning, video retrieval. One of the most challenging problems in video
understanding is dealing with feature extraction, i.e. extract contextual
visual representation from given untrimmed video due to the long and
complicated temporal structure of unconstrained videos. Different from existing
approaches, which apply a pre-trained backbone network as a black-box to
extract visual representation, our approach aims to extract the most contextual
information with an explainable mechanism. As we observed, humans typically
perceive a video through the interactions between three main factors, i.e., the
actors, the relevant objects, and the surrounding environment. Therefore, it is
very crucial to design a contextual explainable video representation extraction
that can capture each of such factors and model the relationships between them.
In this paper, we discuss approaches, that incorporate the human perception
process into modeling actors, objects, and the environment. We choose video
paragraph captioning and temporal action detection to illustrate the
effectiveness of human perception based-contextual representation in video
understanding. Source code is publicly available at
https://github.com/UARK-AICV/Video_Representation.
Related papers
- A Survey of Video Datasets for Grounded Event Understanding [34.11140286628736]
multimodal AI systems must be capable of well-rounded common-sense reasoning akin to human visual understanding.
We survey 105 video datasets that require event understanding capability.
arXiv Detail & Related papers (2024-06-14T00:36:55Z) - Deep video representation learning: a survey [4.9589745881431435]
We recent sequential feature learning methods for visual data and compare their pros and cons for general video analysis.
Building effective features for videos is a fundamental problem in computer vision tasks involving video analysis and understanding.
arXiv Detail & Related papers (2024-05-10T16:20:11Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Knowledge-enhanced Multi-perspective Video Representation Learning for
Scene Recognition [33.800842679024164]
We address the problem of video scene recognition, whose goal is to learn a high-level video representation to classify scenes in videos.
Most existing works identify scenes for videos only from visual or textual information in a temporal perspective.
We propose a novel two-stream framework to model video representations from multiple perspectives.
arXiv Detail & Related papers (2024-01-09T04:37:10Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Weakly Supervised Human-Object Interaction Detection in Video via
Contrastive Spatiotemporal Regions [81.88294320397826]
A system does not know what human-object interactions are present in a video as or the actual location of the human and object.
We introduce a dataset comprising over 6.5k videos with human-object interaction that have been curated from sentence captions.
We demonstrate improved performance over weakly supervised baselines adapted to our annotations on our video dataset.
arXiv Detail & Related papers (2021-10-07T15:30:18Z) - Highlight Timestamp Detection Model for Comedy Videos via Multimodal
Sentiment Analysis [1.6181085766811525]
We propose a multimodal structure to obtain state-of-the-art performance in this field.
We select several benchmarks for multimodal video understanding and apply the most suitable model to find the best performance.
arXiv Detail & Related papers (2021-05-28T08:39:19Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.