Related papers: MegBA: A High-Performance and Distributed Library for Large-Scale Bundle Adjustment

MegBA: A High-Performance and Distributed Library for Large-Scale Bundle Adjustment

URL: http://arxiv.org/abs/2112.01349v1
Date: Thu, 2 Dec 2021 15:50:18 GMT
Title: MegBA: A High-Performance and Distributed Library for Large-Scale Bundle Adjustment
Authors: Jie Ren, Wenteng Liang, Ran Yan, Luo Mai, Shiwen Liu, Xiao Liu
Abstract summary: MegBA is a high-performance and distributed library for large-scale Bundle Adjustment. It can out-perform state-of-the-art BA libraries by up to 33x and 3.3x respectively in public large-scale BA benchmarks.
Score: 4.719974460724886
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale Bundle Adjustment (BA) is the key for many 3D vision applications (e.g., Structure-from-Motion and SLAM). Though important, large-scale BA is still poorly supported by existing BA libraries (e.g., Ceres and g2o). These libraries under-utilise accelerators (i.e., GPUs), and they lack algorithms to distribute BA computation constrained by the memory on a single device. In this paper, we propose MegBA, a high-performance and distributed library for large-scale BA. MegBA has a novel end-to-end vectorised BA algorithm that can fully exploit the massive parallel cores on GPUs, thus speeding up the entire BA computation. It also has a novel distributed BA algorithm that can automatically partition BA problems, and solve BA sub-problems using distributed GPUs. The GPUs synchronise intermediate solving state using network-efficient collective communication, and the synchronisation is designed to minimise communication cost. MegBA has a memory-efficient GPU runtime and exposes g2o-compatible APIs. Experiments show that MegBA can out-perform state-of-the-art BA libraries (i.e., Ceres and DeepLM) by up to 33x and 3.3x respectively, in public large-scale BA benchmarks. The code of MegBA is available at: \url{https://github.com/MegviiRobot/MegBA}.

Related papers

PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters [36.52497630960292]
prima is a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support. prima outperforms llama, exo, andama on 30B+ models while keeping memory pressure below 6%. This brings frontier 30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home assistants, making advanced AI truly accessible to individuals.
arXiv Detail & Related papers (2025-04-07T13:46:21Z)
CAT: A GPU-Accelerated FHE Framework with Its Application to High-Precision Private Dataset Query [0.51795041186793]
We introduce an open-source GPU-accelerated fully homomorphic encryption (FHE) framework CAT. emphCAT features a three-layer architecture: a foundation of core math, a bridge of pre-computed elements and combined operations, and an API-accessible layer of FHE operators. Based on our framework, we implement three widely used FHE schemes: CKKS, BFV, and BGV.
arXiv Detail & Related papers (2025-03-28T08:20:18Z)
BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems [56.16884466478886]
BurTorch is a compact high-performance framework designed to optimize Deep Learning (DL) training on single-node workstations. BurTorch adopts a minimalist design and demonstrates that, in these circumstances, classical compiled programming languages can play a significant role in DL research.
arXiv Detail & Related papers (2025-03-18T00:52:12Z)
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z)
BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments [53.71158537264695]
Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. We introduce textbfBitStack, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance.
arXiv Detail & Related papers (2024-10-31T13:26:11Z)
Bundle Adjustment in the Eager Mode [14.13835018035969]
We introduce an eager-mode Bundle adjustment framework seamlessly integrated with PyPose. Our approach includes GPU-accelerated, differentiable, and sparse operations designed for 2nd-order optimization, Lie group and Lie algebra operations, and linear solvers. Our approach demonstrates substantial efficiency, achieving an average speedup of 18.5$times$, 22$times$, and 23$times$ compared to GTSAM, g$2$o, and Ceres, respectively.
arXiv Detail & Related papers (2024-09-18T17:59:29Z)
XLB: A differentiable massively parallel lattice Boltzmann library in Python [0.0]
We introduce XLB library, a Python-based differentiable LBM library based on the JAX platform. XLB's differentiability and data structure is compatible with the extensive JAX-based machine learning ecosystem. XLB has been successfully scaled to handle simulations with billions of cells, achieving giga-scale lattice updates per second.
arXiv Detail & Related papers (2023-11-27T18:50:37Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
CPU- and GPU-based Distributed Sampling in Dirichlet Process Mixtures for Large-scale Analysis [11.071895608242675]
Dirichlet Process Mixture Model (DPMM) is a principled approach for adapting the complexity of the model to the data. Despite their potential and mathematical elegance, DPMMs have yet to become a mainstream tool widely adopted by practitioners. We propose a new, easy-to-use, statistical software package for scalable DPMMM inference.
arXiv Detail & Related papers (2022-04-19T16:35:44Z)
ReservoirComputing.jl: An Efficient and Modular Library for Reservoir Computing Models [0.17499351967216337]
ReservoirComputing.jl is an open source Julia library for reservoir computing models. The code and documentation are hosted on Github under an MIT license.
arXiv Detail & Related papers (2022-04-08T13:33:09Z)
PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning. However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware. PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z)
HeAT -- a Distributed and GPU-accelerated Tensor Framework for Data Analytics [0.0]
HeAT is an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API. HeAT utilizes PyTorch as a node-local eager execution engine and distributes the workload on arbitrarily large high-performance computing systems via MPI. When compared to similar frameworks, HeAT achieves speedups of up to two orders of magnitude.
arXiv Detail & Related papers (2020-07-27T13:33:17Z)
Hybrid Models for Learning to Branch [81.93868699246214]
We propose a new hybrid architecture for efficient branching on CPU machines. The proposed architecture combines the expressive power of GNNs with computationally inexpensive multi-layer perceptrons (MLP) for branching.
arXiv Detail & Related papers (2020-06-26T21:03:45Z)
MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle. Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.