Leveraging VLM-Based Pipelines to Annotate 3D Objects
- URL: http://arxiv.org/abs/2311.17851v2
- Date: Mon, 17 Jun 2024 17:27:19 GMT
- Title: Leveraging VLM-Based Pipelines to Annotate 3D Objects
- Authors: Rishabh Kabra, Loic Matthey, Alexander Lerchner, Niloy J. Mitra,
- Abstract summary: We propose an alternative algorithm to marginalize over factors such as the viewpoint that affect the VLM's response.
Instead of merging text-only responses, we utilize the VLM's joint image-text likelihoods.
We show how a VLM-based pipeline can be leveraged to produce reliable annotations for 764K objects from the 764K dataset.
- Score: 68.51034848207355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained vision language models (VLMs) present an opportunity to caption unlabeled 3D objects at scale. The leading approach to summarize VLM descriptions from different views of an object (Luo et al., 2023) relies on a language model (GPT4) to produce the final output. This text-based aggregation is susceptible to hallucinations as it merges potentially contradictory descriptions. We propose an alternative algorithm to marginalize over factors such as the viewpoint that affect the VLM's response. Instead of merging text-only responses, we utilize the VLM's joint image-text likelihoods. We show our probabilistic aggregation is not only more reliable and efficient, but sets the SoTA on inferring object types with respect to human-verified labels. The aggregated annotations are also useful for conditional inference; they improve downstream predictions (e.g., of object material) when the object's type is specified as an auxiliary text-based input. Such auxiliary inputs allow ablating the contribution of visual reasoning over visionless reasoning in an unsupervised setting. With these supervised and unsupervised evaluations, we show how a VLM-based pipeline can be leveraged to produce reliable annotations for 764K objects from the Objaverse dataset.
Related papers
- Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks [41.488394198111976]
Vision language models (VLMs) like CLIP show stellar zero-shot capability on classification benchmarks.
selecting the VLM with the highest performance on the unlabeled downstream task is non-trivial.
This paper introduces the problem of textbfunsupervised vision-language model selection, where only unsupervised downstream datasets are available.
arXiv Detail & Related papers (2024-12-30T03:26:53Z) - LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query.
We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.
We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z) - BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models [20.697019266074747]
Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM)
Recent studies show that VLMs are vulnerable to hallucination.
We introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID)
arXiv Detail & Related papers (2024-07-18T12:11:12Z) - VAPO: Visibility-Aware Keypoint Localization for Efficient 6DoF Object Pose Estimation [52.81869878956534]
Localizing 3D keypoints in a 2D image is an effective way to establish 3D-2D correspondences for 6DoF object pose estimation.
In this paper, we address this issue by localizing the important keypoints in terms of visibility.
We construct VAPO (Visibility-Aware POse estimator) by integrating the visibility-aware importance with a state-of-the-art pose estimation algorithm.
arXiv Detail & Related papers (2024-03-21T16:59:45Z) - 3VL: Using Trees to Improve Vision-Language Models' Interpretability [40.678288227161936]
Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks.
These representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects' attributes, states, and relations between different objects.
In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool.
arXiv Detail & Related papers (2023-12-28T20:26:03Z) - GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language
Pre-training and Open-Vocabulary Object Detection [24.48128633414131]
We propose a zero-shot method that harnesses visual grounding ability from existing models trained from image-text pairs and pure object detection data.
We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets.
arXiv Detail & Related papers (2023-12-22T20:14:55Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Towards Addressing the Misalignment of Object Proposal Evaluation for
Vision-Language Tasks via Semantic Grounding [36.03994217853856]
The performance of object proposals generated for Vision-Language (VL) tasks is currently evaluated across all available annotations.
Our work serves as a study of this phenomenon and explores the effectiveness of semantic grounding to mitigate its effects.
We show that our method is consistent and demonstrates greatly improved alignment with annotations selected by image captioning metrics and human annotation.
arXiv Detail & Related papers (2023-09-01T02:19:41Z) - Reflection Invariance Learning for Few-shot Semantic Segmentation [53.20466630330429]
Few-shot semantic segmentation (FSS) aims to segment objects of unseen classes in query images with only a few annotated support images.
This paper proposes a fresh few-shot segmentation framework to mine the reflection invariance in a multi-view matching manner.
Experiments on both PASCAL-$5textiti$ and COCO-$20textiti$ datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2023-06-01T15:14:58Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.