Less is More: Sparse Sampling for Dense Reaction Predictions
- URL: http://arxiv.org/abs/2106.01764v1
- Date: Thu, 3 Jun 2021 11:33:59 GMT
- Title: Less is More: Sparse Sampling for Dense Reaction Predictions
- Authors: Kezhou Lin and Xiaohan Wang and Zhedong Zheng and Linchao Zhu and Yi
Yang
- Abstract summary: We present our method for 2021 Evoked Expression from Videos Challenge.
Our model utilizes both audio and image modalities as inputs to predict emotion changes of viewers.
The proposed method has achieved pearson's correlation score of 0.04430 on the final private test set.
- Score: 60.005266111509435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Obtaining viewer responses from videos can be useful for creators and
streaming platforms to analyze the video performance and improve the future
user experience. In this report, we present our method for 2021 Evoked
Expression from Videos Challenge. In particular, our model utilizes both audio
and image modalities as inputs to predict emotion changes of viewers. To model
long-range emotion changes, we use a GRU-based model to predict one sparse
signal with 1Hz. We observe that the emotion changes are smooth. Therefore, the
final dense prediction is obtained via linear interpolating the signal, which
is robust to the prediction fluctuation. Albeit simple, the proposed method has
achieved pearson's correlation score of 0.04430 on the final private test set.
Related papers
- Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Motion and Context-Aware Audio-Visual Conditioned Video Prediction [58.9467115916639]
We decouple the audio-visual conditioned video prediction into motion and appearance modeling.
The multimodal motion estimation predicts future optical flow based on the audio-motion correlation.
We propose context-aware refinement to address the diminishing of the global appearance context.
arXiv Detail & Related papers (2022-12-09T05:57:46Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - Fourier-based Video Prediction through Relational Object Motion [28.502280038100167]
Deep recurrent architectures have been applied to the task of video prediction.
Here, we explore a different approach by using frequency-domain approaches for video prediction.
The resulting predictions are consistent with the observed dynamics in a scene and do not suffer from blur.
arXiv Detail & Related papers (2021-10-12T10:43:05Z) - Semantic Prediction: Which One Should Come First, Recognition or
Prediction? [21.466783934830925]
One of the primary downstream tasks is interpreting the scene's semantic composition and using it for decision-making.
There are two main ways to achieve the same outcome, given a pre-trained video prediction and pre-trained semantic extraction model.
We investigate these configurations using the Local Frequency Domain Transformer Network (LFDTN) as the video prediction model and U-Net as the semantic extraction model on synthetic and real datasets.
arXiv Detail & Related papers (2021-10-06T15:01:05Z) - Novel View Video Prediction Using a Dual Representation [51.58657840049716]
Given a set of input video clips from a single/multiple views, our network is able to predict the video from a novel view.
The proposed approach does not require any priors and is able to predict the video from wider angular distances, upto 45 degree.
A comparison with the State-of-the-art novel view video prediction methods shows an improvement of 26.1% in SSIM, 13.6% in PSNR, and 60% inFVD scores without using explicit priors from target views.
arXiv Detail & Related papers (2021-06-07T20:41:33Z) - Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video.
The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution.
We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z) - Motion Prediction Using Temporal Inception Module [96.76721173517895]
We propose a Temporal Inception Module (TIM) to encode human motion.
Our framework produces input embeddings using convolutional layers, by using different kernel sizes for different input lengths.
The experimental results on standard motion prediction benchmark datasets Human3.6M and CMU motion capture dataset show that our approach consistently outperforms the state of the art methods.
arXiv Detail & Related papers (2020-10-06T20:26:01Z) - Motion Segmentation using Frequency Domain Transformer Networks [29.998917158604694]
We propose a novel end-to-end learnable architecture that predicts the next frame by modeling foreground and background separately.
Our approach can outperform some widely used video prediction methods like Video Ladder Network and Predictive Gated Pyramids on synthetic data.
arXiv Detail & Related papers (2020-04-18T15:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.