Related papers: Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection

Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection

URL: http://arxiv.org/abs/2502.01401v3
Date: Thu, 20 Feb 2025 08:59:27 GMT
Title: Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection
Authors: Boyu Mi, Hanqing Wang, Tai Wang, Yilun Chen, Jiangmiao Pang,
Abstract summary: Evolvable Symbolic Visual Grounder (EaSe) is a training-free symbolic framework for 3D visual grounding.<n>EaSe achieves 52.9% accuracy on Nr3D dataset and 49.2% Acc@0.25 on ScanRefer.<n>It substantially reduces the inference time and cost, offering a balanced trade-off between performance and efficiency.
Score: 25.520626014113585
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 3D visual grounding (3DVG) is challenging because of the requirement of understanding on visual information, language and spatial relationships. While supervised approaches have achieved superior performance, they are constrained by the scarcity and high cost of 3D vision-language datasets. On the other hand, LLM/VLM based agents are proposed for 3DVG, eliminating the need for training data. However, these methods incur prohibitive time and token costs during inference. To address the challenges, we introduce a novel training-free symbolic framework for 3D visual grounding, namely Evolvable Symbolic Visual Grounder, that offers significantly reduced inference costs compared to previous agent-based methods while maintaining comparable performance. EaSe uses LLM generated codes to compute on spatial relationships. EaSe also implements an automatic pipeline to evaluate and optimize the quality of these codes and integrate VLMs to assist in the grounding process. Experimental results demonstrate that EaSe achieves 52.9% accuracy on Nr3D dataset and 49.2% Acc@0.25 on ScanRefer, which is top-tier among training-free methods. Moreover, it substantially reduces the inference time and cost, offering a balanced trade-off between performance and efficiency. Codes are available at https://github.com/OpenRobotLab/EaSe.

Related papers

Learn 3D VQA Better with Active Selection and Reannotation [46.687613392366174]
In 3D VQA, the free-form nature of answers often leads to improper annotations that can confuse or mislead models when training on the entire dataset.<n>We propose a multi-turn interactive active learning strategy that selects data based on models' semantic uncertainty to form a solid knowledge foundation.<n>Experiments exhibit better model performance and a substantial reduction in training costs, with a halving of training costs for achieving relatively high accuracy.
arXiv Detail & Related papers (2025-07-07T03:18:54Z)
DC-Scene: Data-Centric Learning for 3D Scene Understanding [11.204526527127094]
3D scene understanding plays a fundamental role in vision applications such as robotics, autonomous driving, and augmented reality.<n>We propose DC-Scene, a data-centric framework tailored for 3D scene understanding.<n>We introduce a CLIP-driven dual-indicator quality (DIQ) filter, combining vision-language alignment scores with caption-loss perplexity, along with a curriculum scheduler.
arXiv Detail & Related papers (2025-05-21T08:05:27Z)
From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs [64.28181017898369]
LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views.<n>LIFT-GS achieves state-of-the-art results with $25.7%$ mAP on open-vocabulary instance segmentation.<n>Remarkably, pretraining effectively multiplies fine-tuning datasets by 2X, demonstrating strong scaling properties.
arXiv Detail & Related papers (2025-02-27T18:59:11Z)
Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance [8.07701188057789]
We introduce a novel semi-supervised framework to alleviate the dependency on densely annotated data.<n>Our approach leverages 2D foundation models to generate essential 3D scene geometric and semantic cues.<n>Our method achieves up to 85% of the fully-supervised performance using only 10% labeled data.
arXiv Detail & Related papers (2024-08-21T12:13:18Z)
P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders [34.64343313442465]
Pre-training in 3D is pivotal for advancing 3D perception tasks. However, the scarcity of clean 3D data poses significant challenges for scaling 3D pre-training efforts. We introduce an innovative self-supervised pre-training framework. Our method achieves state-of-the-art performance in 3D classification, detection, and few-shot learning.
arXiv Detail & Related papers (2024-08-19T13:59:53Z)
Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization [51.33923845954759]
3D Visual Grounding (3DVG) and 3D Captioning (3DDC) are two crucial tasks in various 3D applications.<n>We propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks.<n>In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection.
arXiv Detail & Related papers (2024-04-17T04:46:27Z)
CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding [23.885017062031217]
3D visual grounding is the ability to localize objects in 3D scenes conditioned by utterances. Most existing methods devote the referring head to localize the referred object directly, causing failure in complex scenarios. We formulate the 3D visual grounding problem as a sequence-to-sequence Seq2Seq task by first predicting a chain of anchors and then the final target.
arXiv Detail & Related papers (2023-10-10T00:07:25Z)
Hierarchical Supervision and Shuffle Data Augmentation for 3D Semi-Supervised Object Detection [90.32180043449263]
State-of-the-art 3D object detectors are usually trained on large-scale datasets with high-quality 3D annotations. A natural remedy is to adopt semi-supervised learning (SSL) by leveraging a limited amount of labeled samples and abundant unlabeled samples. This paper introduces a novel approach of Hierarchical Supervision and Shuffle Data Augmentation (HSSDA), which is a simple yet effective teacher-student framework.
arXiv Detail & Related papers (2023-04-04T02:09:32Z)
A New Benchmark: On the Utility of Synthetic Data with Blender for Bare Supervised Learning and Downstream Domain Adaptation [42.2398858786125]
Deep learning in computer vision has achieved great success with the price of large-scale labeled training data. The uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist. To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization.
arXiv Detail & Related papers (2023-03-16T09:03:52Z)
Learning from Unlabeled 3D Environments for Vision-and-Language Navigation [87.03299519917019]
In vision-and-language navigation (VLN), an embodied agent is required to navigate in realistic 3D environments following natural language instructions. We propose to automatically create a large-scale VLN dataset from 900 unlabeled 3D buildings from HM3D. We experimentally demonstrate that HM3D-AutoVLN significantly increases the generalization ability of resulting VLN models.
arXiv Detail & Related papers (2022-08-24T21:50:20Z)
Open-Set Semi-Supervised Learning for 3D Point Cloud Understanding [62.17020485045456]
It is commonly assumed in semi-supervised learning (SSL) that the unlabeled data are drawn from the same distribution as that of the labeled ones. We propose to selectively utilize unlabeled data through sample weighting, so that only conducive unlabeled data would be prioritized.
arXiv Detail & Related papers (2022-05-02T16:09:17Z)
Unsupervised Learning of slow features for Data Efficient Regression [15.73372211126635]
We propose the slow variational autoencoder (S-VAE), an extension to the $beta$-VAE which applies a temporal similarity constraint to the latent representations. We evaluate the three methods against their data-efficiency on down-stream tasks using a synthetic 2D ball tracking dataset, a dataset from a reinforcent learning environment and a dataset generated using the DeepMind Lab environment.
arXiv Detail & Related papers (2020-12-11T12:19:45Z)
SelfVoxeLO: Self-supervised LiDAR Odometry with Voxel-based Deep Neural Networks [81.64530401885476]
We propose a self-supervised LiDAR odometry method, dubbed SelfVoxeLO, to tackle these two difficulties. Specifically, we propose a 3D convolution network to process the raw LiDAR data directly, which extracts features that better encode the 3D geometric patterns. We evaluate our method's performances on two large-scale datasets, i.e., KITTI and Apollo-SouthBay.
arXiv Detail & Related papers (2020-10-19T09:23:39Z)
Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation [107.07047303858664]
Large-scale human datasets with 3D ground-truth annotations are difficult to obtain in the wild. We address this problem by augmenting existing 2D datasets with high-quality 3D pose fits. The resulting annotations are sufficient to train from scratch 3D pose regressor networks that outperform the current state-of-the-art on in-the-wild benchmarks.
arXiv Detail & Related papers (2020-04-07T20:21:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.