Related papers: Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats

Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats

URL: http://arxiv.org/abs/2410.12781v2
Date: Fri, 01 Aug 2025 04:29:18 GMT
Title: Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats
Authors: Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, Zexiang Xu,
Abstract summary: Long-LRM is a feed-forward 3D Gaussian reconstruction model for instant, high-resolution, 360deg wide-coverage, scene-level reconstruction.<n>It takes in 32 input images at a resolution of 960x540 and produces the reconstruction in just 1 second on a single A100 GPU.<n>We evaluate Long-LRM on the large-scale DL3DV benchmark and Tanks&Temples, demonstrating reconstruction quality comparable to the optimization-based methods.
Score: 31.37432523412404
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose Long-LRM, a feed-forward 3D Gaussian reconstruction model for instant, high-resolution, 360{\deg} wide-coverage, scene-level reconstruction. Specifically, it takes in 32 input images at a resolution of 960x540 and produces the Gaussian reconstruction in just 1 second on a single A100 GPU. To handle the long sequence of 250K tokens brought by the large input size, Long-LRM features a mixture of the recent Mamba2 blocks and the classical transformer blocks, enhanced by a light-weight token merging module and Gaussian pruning steps that balance between quality and efficiency. We evaluate Long-LRM on the large-scale DL3DV benchmark and Tanks&Temples, demonstrating reconstruction quality comparable to the optimization-based methods while achieving an 800x speedup w.r.t. the optimization-based approaches and an input size at least 60x larger than the previous feed-forward approaches. We conduct extensive ablation studies on our model design choices for both rendering quality and computation efficiency. We also explore Long-LRM's compatibility with other Gaussian variants such as 2D GS, which enhances Long-LRM's ability in geometry reconstruction. Project page: https://arthurhero.github.io/projects/llrm

Related papers

KaoLRM: Repurposing Pre-trained Large Reconstruction Models for Parametric 3D Face Reconstruction [51.67605823241639]
KaoLRM re-targets the learned prior of the Large Reconstruction Model (LRM) for parametric 3D face reconstruction from single-view images.<n> Experiments on both controlled and in-the-wild benchmarks demonstrate that KaoLRM achieves superior reconstruction accuracy and cross-view consistency.
arXiv Detail & Related papers (2026-01-19T05:36:59Z)
Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction [19.234118544637592]
Long-LRM++ is a model that adopts a semi-explicit scene representation combined with a lightweight decoder.<n>Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU.<n>Our design also scales to 64 input views at the $950times540$ resolution, demonstrating strong generalization to increased input lengths.
arXiv Detail & Related papers (2025-12-11T04:10:21Z)
E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources [12.244453688491731]
Efficient Multimodal Diffusion Transformer (E-MMDiT) is an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis.<n>Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO.
arXiv Detail & Related papers (2025-10-31T03:13:08Z)
LongSplat: Online Generalizable 3D Gaussian Splatting from Long Sequence Images [44.558724617615006]
LongSplat is an online real-time 3D Gaussian reconstruction framework for long-sequence image input.<n>GIR encodes 3D Gaussian parameters into a structured, image-like 2D format.<n>LongSplat achieves state-of-the-art efficiency-quality trade-offs in real-time novel view synthesis.
arXiv Detail & Related papers (2025-07-22T01:43:51Z)
RelitLRM: Generative Relightable Radiance for Large Reconstruction Models [52.672706620003765]
We propose RelitLRM for generating high-quality Gaussian splatting representations of 3D objects under novel illuminations. Unlike prior inverse rendering methods requiring dense captures and slow optimization, RelitLRM adopts a feed-forward transformer-based model. We show our sparse-view feed-forward RelitLRM offers competitive relighting results to state-of-the-art dense-view optimization-based baselines.
arXiv Detail & Related papers (2024-10-08T17:40:01Z)
M-LRM: Multi-view Large Reconstruction Model [37.46572626325514]
Multi-view Large Reconstruction Model (M-LRM) designed to efficiently reconstruct high-quality 3D shapes from multi-views in a 3D-aware manner. Compared to Large Reconstruction Model, the proposed M-LRM can produce a tri-plane NeRF with $128 times 128$ resolution and generate 3D shapes of high fidelity.
arXiv Detail & Related papers (2024-06-11T18:29:13Z)
MVGamba: Unify 3D Content Generation as State Space Sequence Modeling [150.80564081817786]
We introduce MVGamba, a general and lightweight Gaussian reconstruction model featuring a multi-view Gaussian reconstructor. With off-the-detail multi-view diffusion models integrated, MVGamba unifies 3D generation tasks from a single image, sparse images, or text prompts. Experiments demonstrate that MVGamba outperforms state-of-the-art baselines in all 3D content generation scenarios with approximately only $0.1times$ of the model size.
arXiv Detail & Related papers (2024-06-10T15:26:48Z)
GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting [49.32327147931905]
We propose GS-LRM, a scalable large reconstruction model that can predict high-quality 3D Gaussians from 2-4 posed sparse images in 0.23 seconds on single A100 GPU. Our model features a very simple transformer-based architecture; we patchify input posed images, pass the primitive multi-view image tokens through a sequence of transformer blocks, and decode final per-pixel Gaussian parameters directly from these tokens for differentiable rendering.
arXiv Detail & Related papers (2024-04-30T16:47:46Z)
MeshLRM: Large Reconstruction Model for High-Quality Meshes [52.71164862539288]
MeshLRM can reconstruct a high-quality mesh from merely four input images in less than one second.<n>Our approach achieves state-of-the-art mesh reconstruction from sparse-view inputs and also allows for many downstream applications.
arXiv Detail & Related papers (2024-04-18T17:59:41Z)
Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction [153.52406455209538]
Gamba is an end-to-end 3D reconstruction model from a single-view image. It completes reconstruction within 0.05 seconds on a single NVIDIA A100 GPU.
arXiv Detail & Related papers (2024-03-27T17:40:14Z)
GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation [85.15374487533643]
We introduce GRM, a large-scale reconstructor capable of recovering a 3D asset from sparse-view images in around 0.1s. GRM is a feed-forward transformer-based model that efficiently incorporates multi-view information. We also showcase the potential of GRM in generative tasks, i.e., text-to-3D and image-to-3D, by integrating it with existing multi-view diffusion models.
arXiv Detail & Related papers (2024-03-21T17:59:34Z)
U-shaped Vision Mamba for Single Image Dehazing [8.134659382415185]
We introduce Vision Mamba (UVM-Net), an efficient single-image dehazing network. Inspired by the State Space Sequence Models (SSMs), a new deep sequence model known for its power to handle long sequences, we design a Bi-SSM block. Our method takes only text0.009 seconds to infer a $325 times 325$ resolution image (100FPS) without I/O handling time.
arXiv Detail & Related papers (2024-02-06T16:46:28Z)
LRM: Large Reconstruction Model for Single Image to 3D [61.47357798633123]
We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds. LRM adopts a highly scalable transformer-based architecture with 500 million learnable parameters to directly predict a neural radiance field (NeRF) from the input image. We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects.
arXiv Detail & Related papers (2023-11-08T00:03:52Z)
Bayesian Image Reconstruction using Deep Generative Models [7.012708932320081]
In this work, we leverage state-of-the-art (SOTA) generative models for building powerful image priors. Our method, called Bayesian Reconstruction through Generative Models (BRGM), uses a single pre-trained generator model to solve different image restoration tasks.
arXiv Detail & Related papers (2020-12-08T17:11:26Z)
Locally Masked Convolution for Autoregressive Models [107.4635841204146]
LMConv is a simple modification to the standard 2D convolution that allows arbitrary masks to be applied to the weights at each location in the image. We learn an ensemble of distribution estimators that share parameters but differ in generation order, achieving improved performance on whole-image density estimation.
arXiv Detail & Related papers (2020-06-22T17:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.