ES-MVSNet: Efficient Framework for End-to-end Self-supervised Multi-View
Stereo
- URL: http://arxiv.org/abs/2308.02191v1
- Date: Fri, 4 Aug 2023 08:16:47 GMT
- Title: ES-MVSNet: Efficient Framework for End-to-end Self-supervised Multi-View
Stereo
- Authors: Qiang Zhou, Chaohui Yu, Jingliang Li, Yuang Liu, Jing Wang, Zhibin
Wang
- Abstract summary: In this work, we propose an efficient framework for end-to-end self-supervised MVS, dubbed ES-MVSNet.
To alleviate the high memory consumption of current E2E self-supervised MVS frameworks, we present a memory-efficient architecture that reduces memory usage by 43% without compromising model performance.
With the novel design of asymmetric view selection policy and region-aware depth consistency, we achieve state-of-the-art performance among E2E self-supervised MVS methods, without relying on third-party models for additional consistency signals.
- Score: 11.41432976633312
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compared to the multi-stage self-supervised multi-view stereo (MVS) method,
the end-to-end (E2E) approach has received more attention due to its concise
and efficient training pipeline. Recent E2E self-supervised MVS approaches have
integrated third-party models (such as optical flow models, semantic
segmentation models, NeRF models, etc.) to provide additional consistency
constraints, which grows GPU memory consumption and complicates the model's
structure and training pipeline. In this work, we propose an efficient
framework for end-to-end self-supervised MVS, dubbed ES-MVSNet. To alleviate
the high memory consumption of current E2E self-supervised MVS frameworks, we
present a memory-efficient architecture that reduces memory usage by 43%
without compromising model performance. Furthermore, with the novel design of
asymmetric view selection policy and region-aware depth consistency, we achieve
state-of-the-art performance among E2E self-supervised MVS methods, without
relying on third-party models for additional consistency signals. Extensive
experiments on DTU and Tanks&Temples benchmarks demonstrate that the proposed
ES-MVSNet approach achieves state-of-the-art performance among E2E
self-supervised MVS methods and competitive performance to many supervised and
multi-stage self-supervised methods.
Related papers
- A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies [51.7643024367548]
Stable Diffusion Model is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation.
This study focuses on reducing redundant computation in SDM and optimizing the model through both tuning and tuning-free methods.
arXiv Detail & Related papers (2024-05-31T21:47:05Z) - ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting.
atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput.
Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z) - Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet)
AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition.
Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z) - MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [27.930351465266515]
We propose a simple yet effective training strategy MoE-Tuning for LVLMs.
MoE-LLaVA, a MoE-based sparse LVLM architecture, uniquely activates only the top-k experts through routers.
Experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks.
arXiv Detail & Related papers (2024-01-29T08:13:40Z) - LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering [6.263815658578159]
The LCV2 modular method is proposed for the Grounded Visual Question Answering task in the vision-language multimodal domain.
This approach relies on a frozen large language model (LLM) as intermediate mediator between the off-the-shelf VQA model and the off-the-shelf visual grounding (VG) model.
This framework can be deployed for VQA Grounding tasks under low computational resources.
arXiv Detail & Related papers (2024-01-29T02:32:25Z) - MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View
Stereo [60.75684891484619]
We introduce MVSFormer++, a method that maximizes the inherent characteristics of attention to enhance various components of the MVS pipeline.
We employ different attention mechanisms for the feature encoder and cost volume regularization, focusing on feature and spatial aggregations respectively.
Comprehensive experiments on DTU, Tanks-and-Temples, BlendedMVS, and ETH3D validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-01-22T03:22:49Z) - EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting
Ego-Motion Rigidity [13.02735046166494]
Self-supervised monocular scene flow estimation has received increasing attention for its simple and economical sensor setup.
We propose a superior model named EMR-MSF by borrowing the advantages of network architecture design under the scope of supervised learning.
On the KITTI scene flow benchmark, our approach improves the SF-all metric of the state-of-the-art self-supervised monocular method by 44%.
arXiv Detail & Related papers (2023-09-04T00:30:06Z) - Can SAM Boost Video Super-Resolution? [78.29033914169025]
We propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM)
This light-weight plug-in module is specifically designed to leverage the attention mechanism for the generation of semantic-aware feature.
We apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort.
arXiv Detail & Related papers (2023-05-11T02:02:53Z) - Digging into Uncertainty in Self-supervised Multi-view Stereo [57.04768354383339]
We propose a novel Uncertainty reduction Multi-view Stereo (UMVS) framework for self-supervised learning.
Our framework achieves the best performance among unsupervised MVS methods, with competitive performance with its supervised opponents.
arXiv Detail & Related papers (2021-08-30T02:53:08Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.