Knowledge-enhanced Multi-perspective Video Representation Learning for
Scene Recognition
- URL: http://arxiv.org/abs/2401.04354v1
- Date: Tue, 9 Jan 2024 04:37:10 GMT
- Title: Knowledge-enhanced Multi-perspective Video Representation Learning for
Scene Recognition
- Authors: Xuzheng Yu, Chen Jiang, Wei Zhang, Tian Gan, Linlin Chao, Jianan Zhao,
Yuan Cheng, Qingpei Guo, Wei Chu
- Abstract summary: We address the problem of video scene recognition, whose goal is to learn a high-level video representation to classify scenes in videos.
Most existing works identify scenes for videos only from visual or textual information in a temporal perspective.
We propose a novel two-stream framework to model video representations from multiple perspectives.
- Score: 33.800842679024164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the explosive growth of video data in real-world applications, a
comprehensive representation of videos becomes increasingly important. In this
paper, we address the problem of video scene recognition, whose goal is to
learn a high-level video representation to classify scenes in videos. Due to
the diversity and complexity of video contents in realistic scenarios, this
task remains a challenge. Most existing works identify scenes for videos only
from visual or textual information in a temporal perspective, ignoring the
valuable information hidden in single frames, while several earlier studies
only recognize scenes for separate images in a non-temporal perspective. We
argue that these two perspectives are both meaningful for this task and
complementary to each other, meanwhile, externally introduced knowledge can
also promote the comprehension of videos. We propose a novel two-stream
framework to model video representations from multiple perspectives, i.e.
temporal and non-temporal perspectives, and integrate the two perspectives in
an end-to-end manner by self-distillation. Besides, we design a
knowledge-enhanced feature fusion and label prediction method that contributes
to naturally introducing knowledge into the task of video scene recognition.
Experiments conducted on a real-world dataset demonstrate the effectiveness of
our proposed method.
Related papers
- Deep video representation learning: a survey [4.9589745881431435]
We recent sequential feature learning methods for visual data and compare their pros and cons for general video analysis.
Building effective features for videos is a fundamental problem in computer vision tasks involving video analysis and understanding.
arXiv Detail & Related papers (2024-05-10T16:20:11Z) - NPF-200: A Multi-Modal Eye Fixation Dataset and Method for
Non-Photorealistic Videos [51.409547544747284]
NPF-200 is the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations.
We conduct a series of analyses to gain deeper insights into this task.
We propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet.
arXiv Detail & Related papers (2023-08-23T14:25:22Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - What You Say Is What You Show: Visual Narration Detection in
Instructional Videos [108.77600799637172]
We introduce the novel task of visual narration detection, which entails determining whether a narration is visually depicted by the actions in the video.
We propose What You Say is What You Show (WYS2), a method that leverages multi-modal cues and pseudo-labeling to learn to detect visual narrations with only weakly labeled data.
Our model successfully detects visual narrations in in-the-wild videos, outperforming strong baselines, and we demonstrate its impact for state-of-the-art summarization and temporal alignment of instructional videos.
arXiv Detail & Related papers (2023-01-05T21:43:19Z) - Contextual Explainable Video Representation:\\Human Perception-based
Understanding [10.172332586182792]
We discuss approaches that incorporate the human perception process into modeling actors, objects, and the environment.
We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding.
arXiv Detail & Related papers (2022-12-12T19:29:07Z) - Self-Supervised Learning for Videos: A Survey [70.37277191524755]
Self-supervised learning has shown promise in both image and video domains.
In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain.
arXiv Detail & Related papers (2022-06-18T00:26:52Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - Highlight Timestamp Detection Model for Comedy Videos via Multimodal
Sentiment Analysis [1.6181085766811525]
We propose a multimodal structure to obtain state-of-the-art performance in this field.
We select several benchmarks for multimodal video understanding and apply the most suitable model to find the best performance.
arXiv Detail & Related papers (2021-05-28T08:39:19Z) - Space-time Neural Irradiance Fields for Free-Viewpoint Video [54.436478702701244]
We present a method that learns a neural irradiance field for dynamic scenes from a single video.
Our learned representation enables free-view rendering of the input video.
arXiv Detail & Related papers (2020-11-25T18:59:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.