Beyond the Camera: Neural Networks in World Coordinates
- URL: http://arxiv.org/abs/2003.05614v1
- Date: Thu, 12 Mar 2020 04:29:34 GMT
- Title: Beyond the Camera: Neural Networks in World Coordinates
- Authors: Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Karteek Alahari
- Abstract summary: Eye movement gives animals increased resolution of the scene and suppresses distracting information.
We propose a simple idea, WorldFeatures, where each feature at every layer has a spatial transformation, and the feature map is only transformed as needed.
We show that a network built with these WorldFeatures, can be used to model eye movements, such as saccades, fixation, and smooth pursuit, even in a batch setting on pre-recorded video.
- Score: 82.31045377469584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Eye movement and strategic placement of the visual field onto the retina,
gives animals increased resolution of the scene and suppresses distracting
information. This fundamental system has been missing from video understanding
with deep networks, typically limited to 224 by 224 pixel content locked to the
camera frame. We propose a simple idea, WorldFeatures, where each feature at
every layer has a spatial transformation, and the feature map is only
transformed as needed. We show that a network built with these WorldFeatures,
can be used to model eye movements, such as saccades, fixation, and smooth
pursuit, even in a batch setting on pre-recorded video. That is, the network
can for example use all 224 by 224 pixels to look at a small detail one moment,
and the whole scene the next. We show that typical building blocks, such as
convolutions and pooling, can be adapted to support WorldFeatures using
available tools. Experiments are presented on the Charades, Olympic Sports, and
Caltech-UCSD Birds-200-2011 datasets, exploring action recognition,
fine-grained recognition, and video stabilization.
Related papers
- Seeing Objects in a Cluttered World: Computational Objectness from
Motion in Video [0.0]
Perception of the visually disjoint surfaces of our world as whole objects physically distinct from those overlapping them forms the basis of our visual perception.
We present a simple but novel approach to infer objectness from phenomenology without object models.
We show that it delivers robust perception of individual attended objects in cluttered scenes, even with blur and camera shake.
arXiv Detail & Related papers (2024-02-02T03:57:11Z) - Panoptic Video Scene Graph Generation [110.82362282102288]
We propose and study a new problem called panoptic scene graph generation (PVSG)
PVSG relates to the existing video scene graph generation problem, which focuses on temporal interactions between humans and objects grounded with bounding boxes in videos.
We contribute the PVSG dataset, which consists of 400 videos (289 third-person + 111 egocentric videos) with a total of 150K frames labeled with panoptic segmentation masks as well as fine, temporal scene graphs.
arXiv Detail & Related papers (2023-11-28T18:59:57Z) - Im4D: High-Fidelity and Real-Time Novel View Synthesis for Dynamic
Scenes [69.52540205439989]
We introduce Im4D, a hybrid representation that consists of a grid-based geometry representation and a multi-view image-based appearance representation.
We represent the scene appearance by the original multi-view videos and a network that learns to predict the color of a 3D point from image features.
We show that Im4D state-of-the-art performance in rendering quality and can be trained efficiently, while realizing real-time rendering with a speed of 79.8 FPS for 512x512 images.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action
Recognition 2022: Team HNU-FPV Technical Report [4.88605334919407]
We present our submission to the 2022 EPIC-Kitchens Unsupervised Domain Adaptation Challenge.
Our method ranks 4th among this year's teams on the test set of EPIC-KITCHENS-100.
arXiv Detail & Related papers (2022-07-07T05:27:32Z) - Playable Environments: Video Manipulation in Space and Time [98.0621309257937]
We present Playable Environments - a new representation for interactive video generation and manipulation in space and time.
With a single image at inference time, our novel framework allows the user to move objects in 3D while generating a video by providing a sequence of desired actions.
Our method builds an environment state for each frame, which can be manipulated by our proposed action module and decoded back to the image space with volumetric rendering.
arXiv Detail & Related papers (2022-03-03T18:51:05Z) - HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular
Video [44.58519508310171]
We introduce a free-viewpoint rendering method -- HumanNeRF -- that works on a given monocular video of a human performing complex body motions.
Our method enables pausing the video at any frame and rendering the subject from arbitrary new camera viewpoints.
arXiv Detail & Related papers (2022-01-11T18:51:21Z) - A Multi-viewpoint Outdoor Dataset for Human Action Recognition [3.522154868524807]
We present a multi-viewpoint outdoor action recognition dataset collected from YouTube and our own drone.
The dataset consists of 20 dynamic human action classes, 2324 video clips and 503086 frames.
The overall baseline action recognition accuracy is 74.0%.
arXiv Detail & Related papers (2021-10-07T14:50:43Z) - Learning Visual Affordance Grounding from Demonstration Videos [76.46484684007706]
Affordance grounding aims to segment all possible interaction regions between people and objects from an image/video.
We propose a Hand-aided Affordance Grounding Network (HAGNet) that leverages the aided clues provided by the position and action of the hand in demonstration videos.
arXiv Detail & Related papers (2021-08-12T11:45:38Z) - A Single Frame and Multi-Frame Joint Network for 360-degree Panorama
Video Super-Resolution [34.35942412092329]
Spherical videos, also known as ang360 (panorama) videos, can be viewed with various virtual reality devices such as computers and head-mounted displays.
We propose a novel single frame and multi-frame joint network (SMFN) for recovering high-resolution spherical videos from low-resolution inputs.
arXiv Detail & Related papers (2020-08-24T11:09:54Z) - Neural Sparse Voxel Fields [151.20366604586403]
We introduce Neural Sparse Voxel Fields (NSVF), a new neural scene representation for fast and high-quality free-viewpoint rendering.
NSVF defines a set of voxel-bounded implicit fields organized in a sparse voxel octree to model local properties in each cell.
Our method is typically over 10 times faster than the state-of-the-art (namely, NeRF(Mildenhall et al., 2020)) at inference time while achieving higher quality results.
arXiv Detail & Related papers (2020-07-22T17:51:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.