Related papers: Knowledge-enhanced Multi-perspective Video Representation Learning for Scene Recognition

Knowledge-enhanced Multi-perspective Video Representation Learning for Scene Recognition

URL: http://arxiv.org/abs/2401.04354v1
Date: Tue, 9 Jan 2024 04:37:10 GMT
Title: Knowledge-enhanced Multi-perspective Video Representation Learning for Scene Recognition
Authors: Xuzheng Yu, Chen Jiang, Wei Zhang, Tian Gan, Linlin Chao, Jianan Zhao, Yuan Cheng, Qingpei Guo, Wei Chu
Abstract summary: We address the problem of video scene recognition, whose goal is to learn a high-level video representation to classify scenes in videos. Most existing works identify scenes for videos only from visual or textual information in a temporal perspective. We propose a novel two-stream framework to model video representations from multiple perspectives.
Score: 33.800842679024164
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the explosive growth of video data in real-world applications, a comprehensive representation of videos becomes increasingly important. In this paper, we address the problem of video scene recognition, whose goal is to learn a high-level video representation to classify scenes in videos. Due to the diversity and complexity of video contents in realistic scenarios, this task remains a challenge. Most existing works identify scenes for videos only from visual or textual information in a temporal perspective, ignoring the valuable information hidden in single frames, while several earlier studies only recognize scenes for separate images in a non-temporal perspective. We argue that these two perspectives are both meaningful for this task and complementary to each other, meanwhile, externally introduced knowledge can also promote the comprehension of videos. We propose a novel two-stream framework to model video representations from multiple perspectives, i.e. temporal and non-temporal perspectives, and integrate the two perspectives in an end-to-end manner by self-distillation. Besides, we design a knowledge-enhanced feature fusion and label prediction method that contributes to naturally introducing knowledge into the task of video scene recognition. Experiments conducted on a real-world dataset demonstrate the effectiveness of our proposed method.

Related papers

Get In Video: Add Anything You Want to the Video [48.06070610416688]
Video editing increasingly demands the ability to incorporate specific real-world instances into existing footage. Current approaches fail to capture the unique visual characteristics of particular subjects and ensure natural instance/scene interactions. We introduce "Get-In-Video Editing", where users provide reference images to precisely specify visual elements they wish to incorporate into videos.
arXiv Detail & Related papers (2025-03-08T16:27:53Z)
Towards Long Video Understanding via Fine-detailed Video Story Generation [58.31050916006673]
Long video understanding has become a critical task in computer vision, driving advancements across numerous applications from surveillance to content retrieval. Existing video understanding methods suffer from two challenges when dealing with long video understanding: intricate long-context relationship modeling and interference from redundancy. We introduce Fine-Detailed Video Story generation (FDVS), which interprets long videos into detailed textual representations.
arXiv Detail & Related papers (2024-12-09T03:41:28Z)
Deep video representation learning: a survey [4.9589745881431435]
We recent sequential feature learning methods for visual data and compare their pros and cons for general video analysis. Building effective features for videos is a fundamental problem in computer vision tasks involving video analysis and understanding.
arXiv Detail & Related papers (2024-05-10T16:20:11Z)
NPF-200: A Multi-Modal Eye Fixation Dataset and Method for Non-Photorealistic Videos [51.409547544747284]
NPF-200 is the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations. We conduct a series of analyses to gain deeper insights into this task. We propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet.
arXiv Detail & Related papers (2023-08-23T14:25:22Z)
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video. Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding. To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z)
What You Say Is What You Show: Visual Narration Detection in Instructional Videos [108.77600799637172]
We introduce the novel task of visual narration detection, which entails determining whether a narration is visually depicted by the actions in the video. We propose What You Say is What You Show (WYS2), a method that leverages multi-modal cues and pseudo-labeling to learn to detect visual narrations with only weakly labeled data. Our model successfully detects visual narrations in in-the-wild videos, outperforming strong baselines, and we demonstrate its impact for state-of-the-art summarization and temporal alignment of instructional videos.
arXiv Detail & Related papers (2023-01-05T21:43:19Z)
Contextual Explainable Video Representation:\\Human Perception-based Understanding [10.172332586182792]
We discuss approaches that incorporate the human perception process into modeling actors, objects, and the environment. We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding.
arXiv Detail & Related papers (2022-12-12T19:29:07Z)
Self-Supervised Learning for Videos: A Survey [70.37277191524755]
Self-supervised learning has shown promise in both image and video domains. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain.
arXiv Detail & Related papers (2022-06-18T00:26:52Z)
Self-Supervised Video Representation Learning with Motion-Contrastive Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet) MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP) Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z)
A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications. Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z)
Highlight Timestamp Detection Model for Comedy Videos via Multimodal Sentiment Analysis [1.6181085766811525]
We propose a multimodal structure to obtain state-of-the-art performance in this field. We select several benchmarks for multimodal video understanding and apply the most suitable model to find the best performance.
arXiv Detail & Related papers (2021-05-28T08:39:19Z)
Space-time Neural Irradiance Fields for Free-Viewpoint Video [54.436478702701244]
We present a method that learns a neural irradiance field for dynamic scenes from a single video. Our learned representation enables free-view rendering of the input video.
arXiv Detail & Related papers (2020-11-25T18:59:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.