Taylor saves for later: disentanglement for video prediction using
Taylor representation
- URL: http://arxiv.org/abs/2105.11062v1
- Date: Mon, 24 May 2021 01:59:21 GMT
- Title: Taylor saves for later: disentanglement for video prediction using
Taylor representation
- Authors: Ting Pan and Zhuqing Jiang and Jianan Han and Shiping Wen and Aidong
Men and Haiying Wang
- Abstract summary: We propose a two-branch seq-to-seq deep model to disentangle the Taylor feature and the residual feature in video frames.
TaylorCell can expand the video frames' high-dimensional features into the finite Taylor series to describe the latent laws.
MCU distills all past frames' information to correct the predicted Taylor feature from TPU.
- Score: 5.658571172210811
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video prediction is a challenging task with wide application prospects in
meteorology and robot systems. Existing works fail to trade off short-term and
long-term prediction performances and extract robust latent dynamics laws in
video frames. We propose a two-branch seq-to-seq deep model to disentangle the
Taylor feature and the residual feature in video frames by a novel recurrent
prediction module (TaylorCell) and residual module. TaylorCell can expand the
video frames' high-dimensional features into the finite Taylor series to
describe the latent laws. In TaylorCell, we propose the Taylor prediction unit
(TPU) and the memory correction unit (MCU). TPU employs the first input frame's
derivative information to predict the future frames, avoiding error
accumulation. MCU distills all past frames' information to correct the
predicted Taylor feature from TPU. Correspondingly, the residual module
extracts the residual feature complementary to the Taylor feature. On three
generalist datasets (Moving MNIST, TaxiBJ, Human 3.6), our model outperforms or
reaches state-of-the-art models, and ablation experiments demonstrate the
effectiveness of our model in long-term prediction.
Related papers
- Forecasting When to Forecast: Accelerating Diffusion Models with Confidence-Gated Taylor [10.899451333703437]
Diffusion Transformers (DiTs) have demonstrated remarkable performance in visual generation tasks.<n>Recent training-free approaches exploit the redundancy of features across timesteps by caching and reusing past representations to accelerate inference.<n>TaylorSeer instead uses cached features to predict future ones via Taylor expansion.<n>We propose a novel approach to better leverage Taylor-based acceleration.
arXiv Detail & Related papers (2025-08-04T09:39:31Z) - MB-TaylorFormer V2: Improved Multi-branch Linear Transformer Expanded by Taylor Formula for Image Restoration [85.41380152286479]
Experimental results across diverse image restoration benchmarks demonstrate that MB-TaylorFormer V2 achieves state-of-the-art performance in multiple image restoration tasks.
The proposed model, named the second version of Taylor formula expansion-based Transformer (for short MB-TaylorFormer V2), has the capability to concurrently process coarse-to-fine features.
arXiv Detail & Related papers (2025-01-08T13:13:52Z) - Video Prediction Transformers without Recurrence or Convolution [65.93130697098658]
We propose PredFormer, a framework entirely based on Gated Transformers.
We provide a comprehensive analysis of 3D Attention in the context of video prediction.
The significant improvements in both accuracy and efficiency highlight the potential of PredFormer.
arXiv Detail & Related papers (2024-10-07T03:52:06Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Taylor Videos for Action Recognition [15.728388101131056]
Taylor video is a new video format that highlights the dominate motions in each of its frames named the Taylor frame.
Taylor video is named after Taylor series, which approximates a function at a given point using important terms.
We show that Taylor videos are effective inputs to popular architectures including 2D CNNs, 3D CNNs, and transformers.
arXiv Detail & Related papers (2024-02-05T14:00:13Z) - MB-TaylorFormer: Multi-branch Efficient Transformer Expanded by Taylor
Formula for Image Dehazing [88.61523825903998]
Transformer networks are beginning to replace pure convolutional neural networks (CNNs) in the field of computer vision.
We propose a new Transformer variant, which applies the Taylor expansion to approximate the softmax-attention and achieves linear computational complexity.
We introduce a multi-branch architecture with multi-scale patch embedding to the proposed Transformer, which embeds features by overlapping deformable convolution of different scales.
Our model, named Multi-branch Transformer expanded by Taylor formula (MB-TaylorFormer), can embed coarse to fine features more flexibly at the patch embedding stage and capture long-distance pixel interactions with limited computational cost
arXiv Detail & Related papers (2023-08-27T08:10:23Z) - Taylorformer: Probabilistic Modelling for Random Processes including Time Series [0.0]
We propose the Taylorformer for random processes such as time series.
Its two key components are: 1) the LocalTaylor wrapper which adapts Taylor approximations for use in neural network-based probabilistic models, and 2) the MHA-X attention block which makes predictions in a way inspired by how Gaussian Processes' mean predictions are linear smoothings of contextual data.
arXiv Detail & Related papers (2023-05-30T15:50:24Z) - CARD: Channel Aligned Robust Blend Transformer for Time Series
Forecasting [50.23240107430597]
We design a special Transformer, i.e., Channel Aligned Robust Blend Transformer (CARD for short), that addresses key shortcomings of CI type Transformer in time series forecasting.
First, CARD introduces a channel-aligned attention structure that allows it to capture both temporal correlations among signals.
Second, in order to efficiently utilize the multi-scale knowledge, we design a token blend module to generate tokens with different resolutions.
Third, we introduce a robust loss function for time series forecasting to alleviate the potential overfitting issue.
arXiv Detail & Related papers (2023-05-20T05:16:31Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - VMFormer: End-to-End Video Matting with Transformer [48.97730965527976]
Video matting aims to predict alpha mattes for each frame from a given input video sequence.
Recent solutions to video matting have been dominated by deep convolutional neural networks (CNN)
We propose VMFormer: a transformer-based end-to-end method for video matting.
arXiv Detail & Related papers (2022-08-26T17:51:02Z) - VPTR: Efficient Transformers for Video Prediction [14.685237010856953]
We propose a new Transformer block for video future frames prediction based on an efficient local spatial-temporal separation attention mechanism.
Based on this new Transformer block, a fully autoregressive video future frames prediction Transformer is proposed.
A non-autoregressive video prediction Transformer is also proposed to increase the inference speed and reduce the accumulated inference errors of its autoregressive counterpart.
arXiv Detail & Related papers (2022-03-29T18:09:09Z) - Transforming Model Prediction for Tracking [109.08417327309937]
Transformers capture global relations with little inductive bias, allowing it to learn the prediction of more powerful target models.
We train the proposed tracker end-to-end and validate its performance by conducting comprehensive experiments on multiple tracking datasets.
Our tracker sets a new state of the art on three benchmarks, achieving an AUC of 68.5% on the challenging LaSOT dataset.
arXiv Detail & Related papers (2022-03-21T17:59:40Z) - Taylor Swift: Taylor Driven Temporal Modeling for Swift Future Frame
Prediction [22.57791389884491]
We introduce TayloSwiftNet, a novel convolutional neural network that learns to estimate the higher order terms of the Taylor series for a given input video.
TayloSwiftNet can swiftly predict any desired future frame in just one forward pass and change the temporal resolution on-the-fly.
arXiv Detail & Related papers (2021-10-27T12:46:17Z) - A Log-likelihood Regularized KL Divergence for Video Prediction with A
3D Convolutional Variational Recurrent Network [17.91970304953206]
We introduce a new variational model that extends the recurrent network in two ways for the task of frame prediction.
First, we introduce 3D convolutions inside all modules including the recurrent model for future prediction frame, inputting sequence and outputting video frames at each timestep.
Second, we enhance the latent loss predictions of the variational model by introducing a maximum likelihood estimate in addition to the KL that is commonly used in variational models.
arXiv Detail & Related papers (2020-12-11T05:05:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.