MegBA: A High-Performance and Distributed Library for Large-Scale Bundle
Adjustment
- URL: http://arxiv.org/abs/2112.01349v1
- Date: Thu, 2 Dec 2021 15:50:18 GMT
- Title: MegBA: A High-Performance and Distributed Library for Large-Scale Bundle
Adjustment
- Authors: Jie Ren, Wenteng Liang, Ran Yan, Luo Mai, Shiwen Liu, Xiao Liu
- Abstract summary: MegBA is a high-performance and distributed library for large-scale Bundle Adjustment.
It can out-perform state-of-the-art BA libraries by up to 33x and 3.3x respectively in public large-scale BA benchmarks.
- Score: 4.719974460724886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale Bundle Adjustment (BA) is the key for many 3D vision applications
(e.g., Structure-from-Motion and SLAM). Though important, large-scale BA is
still poorly supported by existing BA libraries (e.g., Ceres and g2o). These
libraries under-utilise accelerators (i.e., GPUs), and they lack algorithms to
distribute BA computation constrained by the memory on a single device.
In this paper, we propose MegBA, a high-performance and distributed library
for large-scale BA. MegBA has a novel end-to-end vectorised BA algorithm that
can fully exploit the massive parallel cores on GPUs, thus speeding up the
entire BA computation. It also has a novel distributed BA algorithm that can
automatically partition BA problems, and solve BA sub-problems using
distributed GPUs. The GPUs synchronise intermediate solving state using
network-efficient collective communication, and the synchronisation is designed
to minimise communication cost. MegBA has a memory-efficient GPU runtime and
exposes g2o-compatible APIs. Experiments show that MegBA can out-perform
state-of-the-art BA libraries (i.e., Ceres and DeepLM) by up to 33x and 3.3x
respectively, in public large-scale BA benchmarks. The code of MegBA is
available at: \url{https://github.com/MegviiRobot/MegBA}.
Related papers
- PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters [36.52497630960292]
prima is a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support.
prima outperforms llama, exo, andama on 30B+ models while keeping memory pressure below 6%.
This brings frontier 30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home assistants, making advanced AI truly accessible to individuals.
arXiv Detail & Related papers (2025-04-07T13:46:21Z) - CAT: A GPU-Accelerated FHE Framework with Its Application to High-Precision Private Dataset Query [0.51795041186793]
We introduce an open-source GPU-accelerated fully homomorphic encryption (FHE) framework CAT.
emphCAT features a three-layer architecture: a foundation of core math, a bridge of pre-computed elements and combined operations, and an API-accessible layer of FHE operators.
Based on our framework, we implement three widely used FHE schemes: CKKS, BFV, and BGV.
arXiv Detail & Related papers (2025-03-28T08:20:18Z) - BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems [56.16884466478886]
BurTorch is a compact high-performance framework designed to optimize Deep Learning (DL) training on single-node workstations.
BurTorch adopts a minimalist design and demonstrates that, in these circumstances, classical compiled programming languages can play a significant role in DL research.
arXiv Detail & Related papers (2025-03-18T00:52:12Z) - MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments [53.71158537264695]
Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices.
We introduce textbfBitStack, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance.
arXiv Detail & Related papers (2024-10-31T13:26:11Z) - Bundle Adjustment in the Eager Mode [14.13835018035969]
We introduce an eager-mode Bundle adjustment framework seamlessly integrated with PyPose.
Our approach includes GPU-accelerated, differentiable, and sparse operations designed for 2nd-order optimization, Lie group and Lie algebra operations, and linear solvers.
Our approach demonstrates substantial efficiency, achieving an average speedup of 18.5$times$, 22$times$, and 23$times$ compared to GTSAM, g$2$o, and Ceres, respectively.
arXiv Detail & Related papers (2024-09-18T17:59:29Z) - XLB: A differentiable massively parallel lattice Boltzmann library in Python [0.0]
We introduce XLB library, a Python-based differentiable LBM library based on the JAX platform.
XLB's differentiability and data structure is compatible with the extensive JAX-based machine learning ecosystem.
XLB has been successfully scaled to handle simulations with billions of cells, achieving giga-scale lattice updates per second.
arXiv Detail & Related papers (2023-11-27T18:50:37Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - CPU- and GPU-based Distributed Sampling in Dirichlet Process Mixtures
for Large-scale Analysis [11.071895608242675]
Dirichlet Process Mixture Model (DPMM) is a principled approach for adapting the complexity of the model to the data.
Despite their potential and mathematical elegance, DPMMs have yet to become a mainstream tool widely adopted by practitioners.
We propose a new, easy-to-use, statistical software package for scalable DPMMM inference.
arXiv Detail & Related papers (2022-04-19T16:35:44Z) - ReservoirComputing.jl: An Efficient and Modular Library for Reservoir
Computing Models [0.17499351967216337]
ReservoirComputing.jl is an open source Julia library for reservoir computing models.
The code and documentation are hosted on Github under an MIT license.
arXiv Detail & Related papers (2022-04-08T13:33:09Z) - PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning.
However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware.
PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z) - HeAT -- a Distributed and GPU-accelerated Tensor Framework for Data
Analytics [0.0]
HeAT is an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API.
HeAT utilizes PyTorch as a node-local eager execution engine and distributes the workload on arbitrarily large high-performance computing systems via MPI.
When compared to similar frameworks, HeAT achieves speedups of up to two orders of magnitude.
arXiv Detail & Related papers (2020-07-27T13:33:17Z) - Hybrid Models for Learning to Branch [81.93868699246214]
We propose a new hybrid architecture for efficient branching on CPU machines.
The proposed architecture combines the expressive power of GNNs with computationally inexpensive multi-layer perceptrons (MLP) for branching.
arXiv Detail & Related papers (2020-06-26T21:03:45Z) - MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical
Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle.
Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.