Video Recognition in Portrait Mode
- URL: http://arxiv.org/abs/2312.13746v1
- Date: Thu, 21 Dec 2023 11:30:02 GMT
- Title: Video Recognition in Portrait Mode
- Authors: Mingfei Han, Linjie Yang, Xiaojie Jin, Jiashi Feng, Xiaojun Chang,
Heng Wang
- Abstract summary: We develop the first dataset dedicated to portrait mode video recognition, namely PortraitMode-400.
We conduct a comprehensive analysis of the impact of video format (portrait mode versus landscape mode) on recognition accuracy and spatial bias due to the different formats.
We design experiments to explore key aspects of portrait mode video recognition, including the choice of data augmentation, evaluation procedure, the importance of temporal information, and the role of audio modality.
- Score: 98.3393666122704
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The creation of new datasets often presents new challenges for video
recognition and can inspire novel ideas while addressing these challenges.
While existing datasets mainly comprise landscape mode videos, our paper seeks
to introduce portrait mode videos to the research community and highlight the
unique challenges associated with this video format. With the growing
popularity of smartphones and social media applications, recognizing portrait
mode videos is becoming increasingly important. To this end, we have developed
the first dataset dedicated to portrait mode video recognition, namely
PortraitMode-400. The taxonomy of PortraitMode-400 was constructed in a
data-driven manner, comprising 400 fine-grained categories, and rigorous
quality assurance was implemented to ensure the accuracy of human annotations.
In addition to the new dataset, we conducted a comprehensive analysis of the
impact of video format (portrait mode versus landscape mode) on recognition
accuracy and spatial bias due to the different formats. Furthermore, we
designed extensive experiments to explore key aspects of portrait mode video
recognition, including the choice of data augmentation, evaluation procedure,
the importance of temporal information, and the role of audio modality.
Building on the insights from our experimental results and the introduction of
PortraitMode-400, our paper aims to inspire further research efforts in this
emerging research area.
Related papers
- PIV3CAMS: a multi-camera dataset for multiple computer vision problems and its application to novel view-point synthesis [120.4361056355332]
This thesis introduces Paired Image and Video data from three CAMeraS, namely PIV3CAMS.
The PIV3CAMS dataset consists of 8385 pairs of images and 82 pairs of videos taken from three different cameras.
In addition to the regeneration of a current state-of-the-art algorithm, we investigate several proposed alternative models that integrate depth information geometrically.
arXiv Detail & Related papers (2024-07-26T12:18:29Z) - BVI-RLV: A Fully Registered Dataset and Benchmarks for Low-Light Video Enhancement [56.97766265018334]
This paper introduces a low-light video dataset, consisting of 40 scenes with various motion scenarios under two distinct low-lighting conditions.
We provide fully registered ground truth data captured in normal light using a programmable motorized dolly and refine it via an image-based approach for pixel-wise frame alignment across different light levels.
Our experimental results demonstrate the significance of fully registered video pairs for low-light video enhancement (LLVE) and the comprehensive evaluation shows that the models trained with our dataset outperform those trained with the existing datasets.
arXiv Detail & Related papers (2024-07-03T22:41:49Z) - Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera [2.427410108595295]
This paper addresses the daily challenges encountered by visually impaired individuals, such as limited access to information, navigation difficulties, and barriers to social interaction.
To alleviate these challenges, we introduce a novel visual question answering dataset.
It features videos captured using a 360-degree egocentric wearable camera, enabling observation of the entire surroundings.
arXiv Detail & Related papers (2024-05-30T08:02:05Z) - Knowledge-enhanced Multi-perspective Video Representation Learning for
Scene Recognition [33.800842679024164]
We address the problem of video scene recognition, whose goal is to learn a high-level video representation to classify scenes in videos.
Most existing works identify scenes for videos only from visual or textual information in a temporal perspective.
We propose a novel two-stream framework to model video representations from multiple perspectives.
arXiv Detail & Related papers (2024-01-09T04:37:10Z) - NPF-200: A Multi-Modal Eye Fixation Dataset and Method for
Non-Photorealistic Videos [51.409547544747284]
NPF-200 is the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations.
We conduct a series of analyses to gain deeper insights into this task.
We propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet.
arXiv Detail & Related papers (2023-08-23T14:25:22Z) - EasyPortrait -- Face Parsing and Portrait Segmentation Dataset [79.16635054977068]
Video conferencing apps have become functional by accomplishing such computer vision-based features as real-time background removal and face beautification.
We create a new dataset, EasyPortrait, for these tasks simultaneously.
It contains 40,000 primarily indoor photos repeating video meeting scenarios with 13,705 unique users and fine-grained segmentation masks separated into 9 classes.
arXiv Detail & Related papers (2023-04-26T12:51:34Z) - Marine Video Kit: A New Marine Video Dataset for Content-based Analysis
and Retrieval [10.526705651297146]
In this paper, we focus on single-shot videos taken from moving cameras in underwater environments.
The first shard of a new Marine Video Kit is presented to serve for video retrieval and other computer vision challenges.
arXiv Detail & Related papers (2022-09-23T10:57:50Z) - ViSeRet: A simple yet effective approach to moment retrieval via
fine-grained video segmentation [6.544437737391409]
This paper presents the 1st place solution to the video retrieval track of the ICCV VALUE Challenge 2021.
We present a simple yet effective approach to jointly tackle two video-text retrieval tasks.
We create an ensemble model that achieves the new state-of-the-art performance on all four datasets.
arXiv Detail & Related papers (2021-10-11T10:39:13Z) - Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [80.7397409377659]
We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets.
Our model is flexible and can be trained on both image and video text datasets, either independently or in conjunction.
We show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks.
arXiv Detail & Related papers (2021-04-01T17:48:27Z) - A Comprehensive Study of Deep Video Action Recognition [35.7068977497202]
Video action recognition is one of the representative tasks for video understanding.
We provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition.
arXiv Detail & Related papers (2020-12-11T18:54:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.