Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection
- URL: http://arxiv.org/abs/2407.08931v1
- Date: Fri, 12 Jul 2024 02:34:11 GMT
- Title: Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection
- Authors: Xingyu Peng, Yan Bai, Chen Gao, Lirong Yang, Fei Xia, Beipeng Mu, Xiaofei Wang, Si Liu,
- Abstract summary: Open-Vocabulary Detection (OVD) is the task of detecting all interesting objects in a given scene without predefined object classes.
We propose a Global-Local Collaborative Scheme (GLIS) for the lidar-based OVD task.
With the global-local information, a Large Language Model (LLM) is applied for chain-of-thought inference.
- Score: 44.92009038111696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-Vocabulary Detection (OVD) is the task of detecting all interesting objects in a given scene without predefined object classes. Extensive work has been done to deal with the OVD for 2D RGB images, but the exploration of 3D OVD is still limited. Intuitively, lidar point clouds provide 3D information, both object level and scene level, to generate trustful detection results. However, previous lidar-based OVD methods only focus on the usage of object-level features, ignoring the essence of scene-level information. In this paper, we propose a Global-Local Collaborative Scheme (GLIS) for the lidar-based OVD task, which contains a local branch to generate object-level detection result and a global branch to obtain scene-level global feature. With the global-local information, a Large Language Model (LLM) is applied for chain-of-thought inference, and the detection result can be refined accordingly. We further propose Reflected Pseudo Labels Generation (RPLG) to generate high-quality pseudo labels for supervision and Background-Aware Object Localization (BAOL) to select precise object proposals. Extensive experiments on ScanNetV2 and SUN RGB-D demonstrate the superiority of our methods. Code is released at https://github.com/GradiusTwinbee/GLIS.
Related papers
- DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [111.68263493302499]
We introduce DetCLIPv3, a high-performing detector that excels at both open-vocabulary object detection and hierarchical labels.
DetCLIPv3 is characterized by three core designs: 1) Versatile model architecture; 2) High information density data; and 3) Efficient training strategy.
DetCLIPv3 demonstrates superior open-vocabulary detection performance, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively.
arXiv Detail & Related papers (2024-04-14T11:01:44Z) - Simple Image-level Classification Improves Open-vocabulary Object
Detection [27.131298903486474]
Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a given set of base categories on which the detection model is trained.
Recent OVOD methods focus on adapting the image-level pre-trained vision-language models (VLMs), such as CLIP, to a region-level object detection task via, eg., region-level knowledge distillation, regional prompt learning, or region-text pre-training.
These methods have demonstrated remarkable performance in recognizing regional visual concepts, but they are weak in exploiting the VLMs' powerful global scene understanding ability learned from the billion-scale
arXiv Detail & Related papers (2023-12-16T13:06:15Z) - What Makes Good Open-Vocabulary Detector: A Disassembling Perspective [6.623703413255309]
Open-vocabulary detection (OVD) is a new object detection paradigm, aiming to localize and recognize unseen objects defined by an unbounded vocabulary.
Previous works mainly focus on the open vocabulary classification part, with less attention on the localization part.
We show in this work that improving localization as well as cross-modal classification complement each other, and compose a good OVD detector jointly.
arXiv Detail & Related papers (2023-09-01T03:03:50Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Learning Object-level Point Augmentor for Semi-supervised 3D Object
Detection [85.170578641966]
We propose an object-level point augmentor (OPA) that performs local transformations for semi-supervised 3D object detection.
In this way, the resultant augmentor is derived to emphasize object instances rather than irrelevant backgrounds.
Experiments on the ScanNet and SUN RGB-D datasets show that the proposed OPA performs favorably against the state-of-the-art methods.
arXiv Detail & Related papers (2022-12-19T06:56:14Z) - 3DLG-Detector: 3D Object Detection via Simultaneous Local-Global Feature
Learning [15.995277437128452]
Capturing both local and global features of irregular point clouds is essential to 3D object detection (3OD)
This paper explores new modules to simultaneously learn local-global features of scene point clouds that serve 3OD positively.
We propose an effective 3OD network via simultaneous local-global feature learning (dubbed 3DLG-Detector)
arXiv Detail & Related papers (2022-08-31T12:23:40Z) - The Devil is in the Task: Exploiting Reciprocal Appearance-Localization
Features for Monocular 3D Object Detection [62.1185839286255]
Low-cost monocular 3D object detection plays a fundamental role in autonomous driving.
We introduce a Dynamic Feature Reflecting Network, named DFR-Net.
We rank 1st among all the monocular 3D object detectors in the KITTI test set.
arXiv Detail & Related papers (2021-12-28T07:31:18Z) - MLCVNet: Multi-Level Context VoteNet for 3D Object Detection [51.45832752942529]
We propose Multi-Level Context VoteNet (MLCVNet) to recognize 3D objects correlatively, building on the state-of-the-art VoteNet.
We introduce three context modules into the voting and classifying stages of VoteNet to encode contextual information at different levels.
Our method is an effective way to promote detection accuracy, achieving new state-of-the-art detection performance on challenging 3D object detection datasets.
arXiv Detail & Related papers (2020-04-12T19:10:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.