Related papers: NPF-200: A Multi-Modal Eye Fixation Dataset and Method for Non-Photorealistic Videos

NPF-200: A Multi-Modal Eye Fixation Dataset and Method for Non-Photorealistic Videos

URL: http://arxiv.org/abs/2308.12163v1
Date: Wed, 23 Aug 2023 14:25:22 GMT
Title: NPF-200: A Multi-Modal Eye Fixation Dataset and Method for Non-Photorealistic Videos
Authors: Ziyu Yang, Sucheng Ren, Zongwei Wu, Nanxuan Zhao, Junle Wang, Jing Qin, Shengfeng He
Abstract summary: NPF-200 is the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations. We conduct a series of analyses to gain deeper insights into this task. We propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet.
Score: 51.409547544747284
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Non-photorealistic videos are in demand with the wave of the metaverse, but lack of sufficient research studies. This work aims to take a step forward to understand how humans perceive non-photorealistic videos with eye fixation (\ie, saliency detection), which is critical for enhancing media production, artistic design, and game user experience. To fill in the gap of missing a suitable dataset for this research line, we present NPF-200, the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations. Our dataset has three characteristics: 1) it contains soundtracks that are essential according to vision and psychological studies; 2) it includes diverse semantic content and videos are of high-quality; 3) it has rich motions across and within videos. We conduct a series of analyses to gain deeper insights into this task and compare several state-of-the-art methods to explore the gap between natural images and non-photorealistic data. Additionally, as the human attention system tends to extract visual and audio features with different frequencies, we propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet, demonstrating the state-of-the-art performance of our task. The results uncover strengths and weaknesses of multi-modal network design and multi-domain training, opening up promising directions for future works. {Our dataset and code can be found at \url{https://github.com/Yangziyu/NPF200}}.

Related papers

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding [59.020450264301026]
VideoLLaMA3 is a more advanced multimodal foundation model for image and video understanding. VideoLLaMA3 has four training stages: Vision Adaptation, Vision-Language Alignment, Fine-tuning, and Video-centric Fine-tuning. VideoLLaMA3 achieves compelling performances in both image and video understanding benchmarks.
arXiv Detail & Related papers (2025-01-22T18:59:46Z)
A Multimodal Framework for Deepfake Detection [0.0]
Deepfakes, synthetic media created using AI, can convincingly alter videos and audio to misrepresent reality. Our research addresses the critical issue of deepfakes through an innovative multimodal approach. Our framework combines visual and auditory analyses, yielding an accuracy of 94%.
arXiv Detail & Related papers (2024-10-04T14:59:10Z)
Implicit-Zoo: A Large-Scale Dataset of Neural Implicit Functions for 2D Images and 3D Scenes [65.22070581594426]
"Implicit-Zoo" is a large-scale dataset requiring thousands of GPU training days to facilitate research and development in this field. We showcase two immediate benefits as it enables to: (1) learn token locations for transformer models; (2) directly regress 3D cameras poses of 2D images with respect to NeRF models. This in turn leads to an improved performance in all three task of image classification, semantic segmentation, and 3D pose regression, thereby unlocking new avenues for research.
arXiv Detail & Related papers (2024-06-25T10:20:44Z)
Knowledge-enhanced Multi-perspective Video Representation Learning for Scene Recognition [33.800842679024164]
We address the problem of video scene recognition, whose goal is to learn a high-level video representation to classify scenes in videos. Most existing works identify scenes for videos only from visual or textual information in a temporal perspective. We propose a novel two-stream framework to model video representations from multiple perspectives.
arXiv Detail & Related papers (2024-01-09T04:37:10Z)
DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-centric Rendering [126.00165445599764]
We present DNA-Rendering, a large-scale, high-fidelity repository of human performance data for neural actor rendering. Our dataset contains over 1500 human subjects, 5000 motion sequences, and 67.5M frames' data volume. We construct a professional multi-view system to capture data, which contains 60 synchronous cameras with max 4096 x 3000 resolution, 15 fps speed, and stern camera calibration steps.
arXiv Detail & Related papers (2023-07-19T17:58:03Z)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks. InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives. InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z)
Robust Pose Transfer with Dynamic Details using Neural Video Rendering [48.48929344349387]
We propose a neural video rendering framework coupled with an image-translation-based dynamic details generation network (D2G-Net) To be specific, a novel texture representation is presented to encode both the static and pose-varying appearance characteristics. We demonstrate that our neural human video is capable of achieving both clearer dynamic details and more robust performance even on short videos with only 2k - 4k frames.
arXiv Detail & Related papers (2021-06-27T03:40:22Z)
MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech. MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z)
OmniDet: Surround View Cameras based Multi-task Visual Perception Network for Autonomous Driving [10.3540046389057]
This work presents a multi-task visual perception network on unrectified fisheye images. It consists of six primary tasks necessary for an autonomous driving system. We demonstrate that the jointly trained model performs better than the respective single task versions.
arXiv Detail & Related papers (2021-02-15T10:46:24Z)
Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.