Unsupervised Action Localization Crop in Video Retargeting for 3D
ConvNets
- URL: http://arxiv.org/abs/2111.07426v1
- Date: Sun, 14 Nov 2021 19:27:13 GMT
- Title: Unsupervised Action Localization Crop in Video Retargeting for 3D
ConvNets
- Authors: Prithwish Jana, Swarnabja Bhaumik and Partha Pratim Mohanta
- Abstract summary: 3D CNNs require a square-shaped video whose dimension is the original one. Random or center-cropping techniques in use may leave out the video's subject altogether.
We propose an unsupervised video cropping approach by shaping this as the spatial and video-to-video synthesis problem.
The synthesized video maintains 1:1 aspect ratio, smaller in size and is targeted at the video-subject throughout the duration.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Untrimmed videos on social media or those captured by robots and surveillance
cameras are of varied aspect ratios. However, 3D CNNs require a square-shaped
video whose spatial dimension is smaller than the original one. Random or
center-cropping techniques in use may leave out the video's subject altogether.
To address this, we propose an unsupervised video cropping approach by shaping
this as a retargeting and video-to-video synthesis problem. The synthesized
video maintains 1:1 aspect ratio, smaller in size and is targeted at the
video-subject throughout the whole duration. First, action localization on the
individual frames is performed by identifying patches with homogeneous motion
patterns and a single salient patch is pin-pointed. To avoid viewpoint jitters
and flickering artifacts, any inter-frame scale or position changes among the
patches is performed gradually over time. This issue is addressed with a
poly-Bezier fitting in 3D space that passes through some chosen pivot
timestamps and its shape is influenced by in-between control timestamps. To
corroborate the effectiveness of the proposed method, we evaluate the video
classification task by comparing our dynamic cropping with static random on
three benchmark datasets: UCF-101, HMDB-51 and ActivityNet v1.3. The clip
accuracy and top-1 accuracy for video classification after our cropping,
outperform 3D CNN performances for same-sized inputs with random crop;
sometimes even surpassing larger random crop sizes.
Related papers
- PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point
Tracking [90.29143475328506]
We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework.
Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion.
We animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos.
arXiv Detail & Related papers (2023-07-27T17:58:11Z) - Video Demoireing with Relation-Based Temporal Consistency [68.20281109859998]
Moire patterns, appearing as color distortions, severely degrade image and video qualities when filming a screen with digital cameras.
We study how to remove such undesirable moire patterns in videos, namely video demoireing.
arXiv Detail & Related papers (2022-04-06T17:45:38Z) - Motion-from-Blur: 3D Shape and Motion Estimation of Motion-blurred
Objects in Videos [115.71874459429381]
We propose a method for jointly estimating the 3D motion, 3D shape, and appearance of highly motion-blurred objects from a video.
Experiments on benchmark datasets demonstrate that our method outperforms previous methods for fast moving object deblurring and 3D reconstruction.
arXiv Detail & Related papers (2021-11-29T11:25:14Z) - Consistent Depth of Moving Objects in Video [52.72092264848864]
We present a method to estimate depth of a dynamic scene, containing arbitrary moving objects, from an ordinary video captured with a moving camera.
We formulate this objective in a new test-time training framework where a depth-prediction CNN is trained in tandem with an auxiliary scene-flow prediction over the entire input video.
We demonstrate accurate and temporally coherent results on a variety of challenging videos containing diverse moving objects (pets, people, cars) as well as camera motion.
arXiv Detail & Related papers (2021-08-02T20:53:18Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - Spatiotemporal Bundle Adjustment for Dynamic 3D Human Reconstruction in
the Wild [49.672487902268706]
We present a framework that jointly estimates camera temporal alignment and 3D point triangulation.
We reconstruct 3D motion trajectories of human bodies in events captured by multiple unsynchronized and unsynchronized video cameras.
arXiv Detail & Related papers (2020-07-24T23:50:46Z) - Across Scales & Across Dimensions: Temporal Super-Resolution using Deep
Internal Learning [11.658606722158517]
We train a video-specific CNN on examples extracted directly from the low-framerate input video.
Our method exploits the strong recurrence of small space-time patches inside a single video sequence.
The higher spatial resolution of video frames provides strong examples as to how to increase the temporal temporal resolution of that video.
arXiv Detail & Related papers (2020-03-19T15:53:01Z) - An Information-rich Sampling Technique over Spatio-Temporal CNN for
Classification of Human Actions in Videos [5.414308305392762]
We propose a novel scheme for human action recognition in videos, using a 3-dimensional Convolutional Neural Network (3D CNN) based classifier.
In this paper, a 3D CNN architecture is proposed to extract featuresweighted and follows Long Short-Term Memory (LSTM) to recognize human actions.
Experiments are performed with KTH and WEIZMANN human actions datasets, whereby it is shown to produce comparable results with the state-of-the-art techniques.
arXiv Detail & Related papers (2020-02-06T05:07:41Z) - Synergetic Reconstruction from 2D Pose and 3D Motion for Wide-Space
Multi-Person Video Motion Capture in the Wild [3.0015034534260665]
We propose a markerless motion capture method with accuracy and smoothness from multiple cameras.
The proposed method predicts each persons 3D pose and determines bounding box of multi-camera images.
We evaluated the proposed method using various datasets and a real sports field.
arXiv Detail & Related papers (2020-01-16T02:14:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.