Related papers: Efficient Encoder-Free Fourier-based 3D Large Multimodal Model

Efficient Encoder-Free Fourier-based 3D Large Multimodal Model

URL: http://arxiv.org/abs/2602.23153v1
Date: Thu, 26 Feb 2026 16:16:02 GMT
Title: Efficient Encoder-Free Fourier-based 3D Large Multimodal Model
Authors: Guofeng Mei, Wei Lin, Luigi Riz, Yujiao Wu, Yiming Wang, Fabio Poiesi,
Abstract summary: Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features.<n>We propose Fase3D, the first efficient encoder-free 3D scene LMM.
Score: 22.758597018527244
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We propose Fase3D, the first efficient encoder-free Fourier-based 3D scene LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space-filling curve serialization followed by an FFT enables efficient global context modeling and graph-based token merging. Lastly, our Fourier-augmented LoRA adapters inject global frequency-aware interactions into the LLMs at a negligible cost. Fase3D achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters. Project website: https://tev-fbk.github.io/Fase3D.

Related papers

PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding [67.15800065888887]
Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning.<n>We introduce an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds.<n>Our 3D encoder achieves zero-shot 3D part segmentation with fast single-pass inference without any test-time multi-view rendering.
arXiv Detail & Related papers (2026-01-05T18:55:45Z)
Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding [80.66591664266744]
Lemon is a unified transformer architecture that processes 3D point cloud patches and language tokens as a single sequence.<n>To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context.<n>Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks.
arXiv Detail & Related papers (2025-12-14T20:02:43Z)
How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need? [56.09721366421187]
We present the finding that tokens are remarkably redundant, leading to substantial inefficiency.<n>We introduce gitmerge3D, a globally informed graph token merging method that can reduce the token count by up to 90-95%.<n>This work is the first to assess redundancy in large-scale 3D transformer models, providing insights into the development of more efficient 3D foundation architectures.
arXiv Detail & Related papers (2025-11-07T17:38:01Z)
TriCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP [52.79100775328595]
3D visual grounding allows an embodied agent to understand visual information in real-world 3D environments based on human instructions.<n>Existing 3D visual grounding methods rely on separate encoders for different modalities.<n>We propose a unified 2D pre-trained multi-modal network to process all three modalities.
arXiv Detail & Related papers (2025-07-20T10:28:06Z)
AdaToken-3D: Dynamic Spatial Gating for Efficient 3D Large Multimodal-Models Reasoning [27.40106634796608]
Large Multimodal Models (LMMs) have become a pivotal research focus in deep learning.<n>Currently, 3D LMMs employing thousands of spatial tokens for multimodal reasoning suffer from critical inefficiencies.<n>We propose AdaToken-3D, an adaptive spatial token optimization framework that dynamically prunes redundant tokens.
arXiv Detail & Related papers (2025-05-19T07:11:07Z)
Exploring the Potential of Encoder-free Architectures in 3D LMMs [40.43146298677712]
We present the first comprehensive investigation into the potential of encoder-free architectures to alleviate the challenges of encoder-based 3D Large Multimodal Models.<n>Our results demonstrate that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding.
arXiv Detail & Related papers (2025-02-13T18:59:45Z)
3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer [33.42183318484381]
We introduce 3D-LLaVA, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world.<n>At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities.
arXiv Detail & Related papers (2025-01-02T09:33:13Z)
YOLOO: You Only Learn from Others Once [27.222676133154284]
We propose textbfYOLOO, a novel multi-modal 3D MOT paradigm: You Only Learn from Others Once.<n>YOLOO empowers the point cloud encoder to learn a unified tri-modal representation (UTR) from point clouds and other modalities, such as images and textual cues, all at once.<n>Specifically, YOLOO includes two core components: a unified tri-modal encoder (UTEnc) and a flexible geometric constraint (F-GC) module.
arXiv Detail & Related papers (2024-09-01T05:09:32Z)
EmbodiedSAM: Online Segment Any 3D Thing in Real Time [61.2321497708998]
Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration.<n>An online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed.
arXiv Detail & Related papers (2024-08-21T17:57:06Z)
Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z)
Making a Case for 3D Convolutions for Object Segmentation in Videos [16.167397418720483]
We show that 3D convolutional networks can be effectively applied to dense video prediction tasks such as salient object segmentation. We propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules. Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal benchmarks.
arXiv Detail & Related papers (2020-08-26T12:24:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.