Related papers: 3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

URL: http://arxiv.org/abs/2603.04976v1
Date: Thu, 05 Mar 2026 09:15:16 GMT
Title: 3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding
Authors: Xiongkun Linghu, Jiangyong Huang, Baoxiong Jia, Siyuan Huang,
Abstract summary: We present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT)<n>3D-RFT is first framework to extend RLVR to video-based 3D perception and reasoning.<n>We show that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks.
Score: 21.70953326671503
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy Optimization ( GRPO) with strictly verifiable reward functions. We design task-specific reward functions directly from metrics like 3D IoU and F1-Score to provide more effective signals to guide model training. Extensive experiments demonstrate that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks. Notably, 3D-RFT-4B significantly outperforms larger models (e.g., VG LLM-8B) on 3D video detection, 3D visual grounding, and spatial reasoning benchmarks. We further reveal good properties of 3D-RFT such as robust efficacy, and valuable insights into training strategies and data impact. We hope 3D-RFT can serve as a robust and promising paradigm for future development of 3D scene understanding.

Related papers

Reasoning Matters for 3D Visual Grounding [39.725360883988515]
We propose a 3D visual grounding data pipeline, which is capable of automatically synthesizing 3D visual grounding data along with corresponding reasoning process.<n>We also introduce Reason3DVG-8B, a strong 3D visual grounding LLM that outperforms previous LLM-based method 3D-GRAND using only 1.6% of their training data.
arXiv Detail & Related papers (2026-01-13T18:48:41Z)
D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation [66.7166217399105]
Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning.<n>Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data.
arXiv Detail & Related papers (2025-12-14T09:53:15Z)
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z)
Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy [4.1703677379815565]
We propose Vid-LLM, a video-based 3D-MLLM that directly processes video inputs without requiring external 3D data.<n>In our method, the geometric prior are directly used to improve the performance of the sceen perception.<n>Experiments across diverse benchmarks verified the effectiveness of our method on 3D Question Answering, 3D Captioning and 3D Visual Grounding tasks.
arXiv Detail & Related papers (2025-09-29T07:34:18Z)
3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding [11.069512983766783]
Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks.<n>We propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs.<n>Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks.
arXiv Detail & Related papers (2025-07-31T11:59:06Z)
TriCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP [52.79100775328595]
3D visual grounding allows an embodied agent to understand visual information in real-world 3D environments based on human instructions.<n>Existing 3D visual grounding methods rely on separate encoders for different modalities.<n>We propose a unified 2D pre-trained multi-modal network to process all three modalities.
arXiv Detail & Related papers (2025-07-20T10:28:06Z)
MLLMs Need 3D-Aware Representation Supervision for Scene Understanding [14.083262551714133]
3DRS is a framework that enhances MLLM 3D representation learning by introducing supervision from pretrained 3D foundation models.<n>Our approach aligns MLLM visual features with rich 3D knowledge distilled from 3D models, effectively improving scene understanding.
arXiv Detail & Related papers (2025-06-02T17:58:24Z)
LLMI3D: MLLM-based 3D Perception from a Single 2D Image [77.13869413871028]
multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks.<n>In this paper, we propose solutions for weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations.<n>We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM.
arXiv Detail & Related papers (2024-08-14T10:00:16Z)
PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [111.16358607889609]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.<n>For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z)
Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation [107.07047303858664]
Large-scale human datasets with 3D ground-truth annotations are difficult to obtain in the wild. We address this problem by augmenting existing 2D datasets with high-quality 3D pose fits. The resulting annotations are sufficient to train from scratch 3D pose regressor networks that outperform the current state-of-the-art on in-the-wild benchmarks.
arXiv Detail & Related papers (2020-04-07T20:21:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.