NPF-200: A Multi-Modal Eye Fixation Dataset and Method for
Non-Photorealistic Videos
- URL: http://arxiv.org/abs/2308.12163v1
- Date: Wed, 23 Aug 2023 14:25:22 GMT
- Title: NPF-200: A Multi-Modal Eye Fixation Dataset and Method for
Non-Photorealistic Videos
- Authors: Ziyu Yang, Sucheng Ren, Zongwei Wu, Nanxuan Zhao, Junle Wang, Jing
Qin, Shengfeng He
- Abstract summary: NPF-200 is the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations.
We conduct a series of analyses to gain deeper insights into this task.
We propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet.
- Score: 51.409547544747284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-photorealistic videos are in demand with the wave of the metaverse, but
lack of sufficient research studies. This work aims to take a step forward to
understand how humans perceive non-photorealistic videos with eye fixation
(\ie, saliency detection), which is critical for enhancing media production,
artistic design, and game user experience. To fill in the gap of missing a
suitable dataset for this research line, we present NPF-200, the first
large-scale multi-modal dataset of purely non-photorealistic videos with eye
fixations. Our dataset has three characteristics: 1) it contains soundtracks
that are essential according to vision and psychological studies; 2) it
includes diverse semantic content and videos are of high-quality; 3) it has
rich motions across and within videos. We conduct a series of analyses to gain
deeper insights into this task and compare several state-of-the-art methods to
explore the gap between natural images and non-photorealistic data.
Additionally, as the human attention system tends to extract visual and audio
features with different frequencies, we propose a universal frequency-aware
multi-modal non-photorealistic saliency detection model called NPSNet,
demonstrating the state-of-the-art performance of our task. The results uncover
strengths and weaknesses of multi-modal network design and multi-domain
training, opening up promising directions for future works. {Our dataset and
code can be found at \url{https://github.com/Yangziyu/NPF200}}.
Related papers
- VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding [59.020450264301026]
VideoLLaMA3 is a more advanced multimodal foundation model for image and video understanding.
VideoLLaMA3 has four training stages: Vision Adaptation, Vision-Language Alignment, Fine-tuning, and Video-centric Fine-tuning.
VideoLLaMA3 achieves compelling performances in both image and video understanding benchmarks.
arXiv Detail & Related papers (2025-01-22T18:59:46Z) - VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation [70.68566282567207]
VisionReward is a fine-grained and multi-dimensional reward model.
We decompose human preferences in images and videos into multiple dimensions.
Based on VisionReward, we develop a multi-objective preference learning algorithm.
arXiv Detail & Related papers (2024-12-30T16:24:09Z) - T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs [102.66246727371583]
We develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus.
We find that the proposed scheme can boost the performance of long video understanding without training with long video samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z) - Knowledge-enhanced Multi-perspective Video Representation Learning for
Scene Recognition [33.800842679024164]
We address the problem of video scene recognition, whose goal is to learn a high-level video representation to classify scenes in videos.
Most existing works identify scenes for videos only from visual or textual information in a temporal perspective.
We propose a novel two-stream framework to model video representations from multiple perspectives.
arXiv Detail & Related papers (2024-01-09T04:37:10Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z) - OmniDet: Surround View Cameras based Multi-task Visual Perception
Network for Autonomous Driving [10.3540046389057]
This work presents a multi-task visual perception network on unrectified fisheye images.
It consists of six primary tasks necessary for an autonomous driving system.
We demonstrate that the jointly trained model performs better than the respective single task versions.
arXiv Detail & Related papers (2021-02-15T10:46:24Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.