FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical
Flow Estimation
- URL: http://arxiv.org/abs/2303.01237v1
- Date: Thu, 2 Mar 2023 13:28:07 GMT
- Title: FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical
Flow Estimation
- Authors: Xiaoyu Shi, Zhaoyang Huang, Dasong Li, Manyuan Zhang, Ka Chun Cheung,
Simon See, Hongwei Qin, Jifeng Dai, Hongsheng Li
- Abstract summary: FlowFormer introduces a transformer architecture into optical flow estimation and achieves state-of-the-art performance.
We propose Masked Cost Volume Autoencoding (MCVA) to enhance FlowFormer by pretraining the cost-volume encoder with a novel MAE scheme.
FlowFormer++ ranks 1st among published methods on both Sintel and KITTI-2015 benchmarks.
- Score: 35.0926239683689
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: FlowFormer introduces a transformer architecture into optical flow estimation
and achieves state-of-the-art performance. The core component of FlowFormer is
the transformer-based cost-volume encoder. Inspired by the recent success of
masked autoencoding (MAE) pretraining in unleashing transformers' capacity of
encoding visual representation, we propose Masked Cost Volume Autoencoding
(MCVA) to enhance FlowFormer by pretraining the cost-volume encoder with a
novel MAE scheme. Firstly, we introduce a block-sharing masking strategy to
prevent masked information leakage, as the cost maps of neighboring source
pixels are highly correlated. Secondly, we propose a novel pre-text
reconstruction task, which encourages the cost-volume encoder to aggregate
long-range information and ensures pretraining-finetuning consistency. We also
show how to modify the FlowFormer architecture to accommodate masks during
pretraining. Pretrained with MCVA, FlowFormer++ ranks 1st among published
methods on both Sintel and KITTI-2015 benchmarks. Specifically, FlowFormer++
achieves 1.07 and 1.94 average end-point error (AEPE) on the clean and final
pass of Sintel benchmark, leading to 7.76\% and 7.18\% error reductions from
FlowFormer. FlowFormer++ obtains 4.52 F1-all on the KITTI-2015 test set,
improving FlowFormer by 0.16.
Related papers
- Improving the Training of Rectified Flows [14.652876697052156]
Diffusion models have shown great promise for image and video generation, but sampling from state-of-the-art models requires expensive numerical integration of a generative ODE.
One approach for tackling this problem is rectified flows, which iteratively learn smooth ODE paths that are less susceptible to truncation error.
We propose improved techniques for training rectified flows, allowing them to compete with emphknowledge distillation methods even in the low NFE setting.
Our improved rectified flow outperforms the state-of-the-art distillation methods such as consistency distillation and progressive distillation in both one-step and two
arXiv Detail & Related papers (2024-05-30T17:56:04Z) - Guided Flows for Generative Modeling and Decision Making [55.42634941614435]
We show that Guided Flows significantly improves the sample quality in conditional image generation and zero-shot text synthesis-to-speech.
Notably, we are first to apply flow models for plan generation in the offline reinforcement learning setting ax speedup in compared to diffusion models.
arXiv Detail & Related papers (2023-11-22T15:07:59Z) - Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained
Vision-Language Models [89.07925369856139]
We design a new type of tuning method, termed as regularized mask tuning, which masks the network parameters through a learnable selection.
Inspired by neural pathways, we argue that the knowledge required by a downstream task already exists in the pre-trained weights but just gets concealed in the upstream pre-training stage.
It is noteworthy that we manage to deliver 18.73% performance improvement compared to the zero-shot CLIP via masking an average of only 2.56% parameters.
arXiv Detail & Related papers (2023-07-27T17:56:05Z) - FlowFormer: A Transformer Architecture and Its Masked Cost Volume
Autoencoding for Optical Flow [49.40637769535569]
This paper introduces a novel transformer-based network architecture, FlowFormer, along with the Masked Cost Volume AutoVA (MCVA) for pretraining it to tackle the problem of optical flow estimation.
FlowFormer tokenizes the 4D cost-volume built from the source-target image pair and iteratively refines flow estimation with a cost-volume encoder-decoder architecture.
On the Sintel benchmark, FlowFormer architecture achieves 1.16 and 2.09 average end-point-error(AEPE) on the clean and final pass, a 16.5% and 15.5% error reduction from the
arXiv Detail & Related papers (2023-06-08T12:24:04Z) - RetroMAE: Pre-training Retrieval-oriented Transformers via Masked
Auto-Encoder [15.24707645921207]
We propose a novel pre-training framework for dense retrieval based on the Masked Auto-Encoder, known as RetroMAE.
We pre-train a BERT like encoder on English Wikipedia and BookCorpus, where it notably outperforms the existing pre-trained models on a wide range of dense retrieval benchmarks.
arXiv Detail & Related papers (2022-05-24T12:43:04Z) - FlowFormer: A Transformer Architecture for Optical Flow [40.6027845855481]
Optical Flow TransFormer (FlowFormer) is a transformer-based neural network architecture for learning optical flow.
FlowFormer tokenizes the 4D cost volume built from an image pair, encodes the cost tokens into a cost memory with alternate-group transformer layers.
On the Sintel benchmark clean pass, FlowFormer achieves 1.178 average end-ponit-error (AEPE), a 15.1% error reduction from the best published result (1.388)
arXiv Detail & Related papers (2022-03-30T10:33:09Z) - GMFlow: Learning Optical Flow via Global Matching [124.57850500778277]
We propose a GMFlow framework for learning optical flow estimation.
It consists of three main components: a customized Transformer for feature enhancement, a correlation and softmax layer for global feature matching, and a self-attention layer for flow propagation.
Our new framework outperforms 32-iteration RAFT's performance on the challenging Sintel benchmark.
arXiv Detail & Related papers (2021-11-26T18:59:56Z) - LiteFlowNet3: Resolving Correspondence Ambiguity for More Accurate
Optical Flow Estimation [99.19322851246972]
We introduce LiteFlowNet3, a deep network consisting of two specialized modules to address the problem of optical flow estimation.
LiteFlowNet3 not only achieves promising results on public benchmarks but also has a small model size and a fast runtime.
arXiv Detail & Related papers (2020-07-18T03:30:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.