Cross-View Exocentric to Egocentric Video Synthesis
- URL: http://arxiv.org/abs/2107.03120v1
- Date: Wed, 7 Jul 2021 10:00:52 GMT
- Title: Cross-View Exocentric to Egocentric Video Synthesis
- Authors: Gaowen Liu, Hao Tang, Hugo Latapie, Jason Corso, Yan Yan
- Abstract summary: Cross-view video synthesis task seeks to generate video sequences of one view from another dramatically different view.
We propose a novel Bi-directional Spatial Temporal Attention Fusion Generative Adversarial Network (STA-GAN) to learn both spatial and temporal information.
The proposed STA-GAN consists of three parts: temporal branch, spatial branch, and attention fusion.
- Score: 18.575642755375107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-view video synthesis task seeks to generate video sequences of one view
from another dramatically different view. In this paper, we investigate the
exocentric (third-person) view to egocentric (first-person) view video
generation task. This is challenging because egocentric view sometimes is
remarkably different from the exocentric view. Thus, transforming the
appearances across the two different views is a non-trivial task. Particularly,
we propose a novel Bi-directional Spatial Temporal Attention Fusion Generative
Adversarial Network (STA-GAN) to learn both spatial and temporal information to
generate egocentric video sequences from the exocentric view. The proposed
STA-GAN consists of three parts: temporal branch, spatial branch, and attention
fusion. First, the temporal and spatial branches generate a sequence of fake
frames and their corresponding features. The fake frames are generated in both
downstream and upstream directions for both temporal and spatial branches.
Next, the generated four different fake frames and their corresponding features
(spatial and temporal branches in two directions) are fed into a novel
multi-generation attention fusion module to produce the final video sequence.
Meanwhile, we also propose a novel temporal and spatial dual-discriminator for
more robust network optimization. Extensive experiments on the Side2Ego and
Top2Ego datasets show that the proposed STA-GAN significantly outperforms the
existing methods.
Related papers
- Put Myself in Your Shoes: Lifting the Egocentric Perspective from
Exocentric Videos [66.46812056962567]
Exocentric-to-egocentric cross-view translation aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective.
We propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation and a pixel-level hallucination.
arXiv Detail & Related papers (2024-03-11T01:00:00Z) - Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [49.298187741014345]
Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V)
We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
arXiv Detail & Related papers (2023-12-07T17:59:07Z) - Learning Fine-grained View-Invariant Representations from Unpaired
Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time.
To this end, we propose AE2, a self-supervised embedding approach with two key designs.
For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.