Related papers: VLMine: Long-Tail Data Mining with Vision Language Models

VLMine: Long-Tail Data Mining with Vision Language Models

URL: http://arxiv.org/abs/2409.15486v1
Date: Mon, 23 Sep 2024 19:13:51 GMT
Title: VLMine: Long-Tail Data Mining with Vision Language Models
Authors: Mao Ye, Gregory P. Meyer, Zaiwei Zhang, Dennis Park, Siva Karthik Mustikovela, Yuning Chai, Eric M Wolff,
Abstract summary: This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM) Our experiments consistently show large improvements (between 10% and 50%) over the baseline techniques.
Score: 18.412533708652102
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.

Related papers

IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection [70.02774285130238]
This paper explores the combination of rich text semantics with both image-level and pixel-level information from images.<n>We propose IAD-GPT, a novel paradigm based on MLLMs for Industrial Anomaly Detection.<n>Experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance.
arXiv Detail & Related papers (2025-10-16T02:48:05Z)
Multilinear subspace learning for person re-identification based fusion of high order tensor features [2.03240755905453]
PRe-ID aims to identify and track target individuals who have already been detected in a network of cameras.<n>To this end, two powerful features, Conal Neural Networks (CNN) and Local Maximal Occurrence (LOMO) are modeled on multidimensional data.<n>New tensor fusion scheme is introduced to leverage and combine these two types of features in a single tensor.
arXiv Detail & Related papers (2025-05-09T23:39:27Z)
Exploring Modality Guidance to Enhance VFM-based Feature Fusion for UDA in 3D Semantic Segmentation [14.651682743504024]
Vision Foundation Models (VFMs) have become a de facto choice for many downstream vision tasks, like image classification, image segmentation, and object localization. In our work, we explore the utility of VFMs for adapting from a labeled source to unlabeled target data for the task of LiDAR-based 3D semantic segmentation. Our method consumes paired 2D-3D (image and point cloud) data and relies on the robust (cross-domain) features from a VFM to train a 3D backbone on a mix of labeled source and unlabeled target data.
arXiv Detail & Related papers (2025-04-19T08:53:54Z)
A Recipe for Improving Remote Sensing VLM Zero Shot Generalization [0.4427533728730559]
We present two novel image-caption datasets for training of remote sensing foundation models. The first dataset pairs aerial and satellite imagery with captions generated by Gemini using landmarks extracted from Google Maps. The second dataset utilizes public web images and their corresponding alt-text, filtered for the remote sensing domain.
arXiv Detail & Related papers (2025-03-10T21:09:02Z)
LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving [52.83707400688378]
LargeAD is a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection.
arXiv Detail & Related papers (2025-01-07T18:59:59Z)
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing. It is designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z)
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [32.57246173437492]
This study introduces a novel dataset named Img-Diff, designed to enhance fine-grained image recognition in MLLMs. By analyzing object differences between similar images, we challenge models to identify both matching and distinct components. We utilize the Stable-Diffusion-XL model and advanced image editing techniques to create pairs of similar images that highlight object replacements.
arXiv Detail & Related papers (2024-08-08T17:10:16Z)
A Multitask Deep Learning Model for Classification and Regression of Hyperspectral Images: Application to the large-scale dataset [44.94304541427113]
We propose a multitask deep learning model to perform multiple classification and regression tasks simultaneously on hyperspectral images. We validated our approach on a large hyperspectral dataset called TAIGA. A comprehensive qualitative and quantitative analysis of the results shows that the proposed method significantly outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-23T11:14:54Z)
Multimodal 3D Object Detection on Unseen Domains [37.142470149311904]
Domain adaptation approaches assume access to unannotated samples from the test distribution to address this problem. We propose CLIX$text3D$, a multimodal fusion and supervised contrastive learning framework for 3D object detection. We show that CLIX$text3D$ yields state-of-the-art domain generalization performance under multiple dataset shifts.
arXiv Detail & Related papers (2024-04-17T21:47:45Z)
Towards Unified 3D Object Detection via Algorithm and Data Unification [70.27631528933482]
We build the first unified multi-modal 3D object detection benchmark MM- Omni3D and extend the aforementioned monocular detector to its multi-modal version. We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively.
arXiv Detail & Related papers (2024-02-28T18:59:31Z)
Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection [59.41026558455904]
We focus on multi-modal anomaly detection. Specifically, we investigate early multi-modal approaches that attempted to utilize models pre-trained on large-scale visual datasets. We propose a Local-to-global Self-supervised Feature Adaptation (LSFA) method to finetune the adaptors and learn task-oriented representation toward anomaly detection.
arXiv Detail & Related papers (2024-01-06T07:30:41Z)
Generalized Few-Shot 3D Object Detection of LiDAR Point Cloud for Autonomous Driving [91.39625612027386]
We propose a novel task, called generalized few-shot 3D object detection, where we have a large amount of training data for common (base) objects, but only a few data for rare (novel) classes. Specifically, we analyze in-depth differences between images and point clouds, and then present a practical principle for the few-shot setting in the 3D LiDAR dataset. To solve this task, we propose an incremental fine-tuning method to extend existing 3D detection models to recognize both common and rare objects.
arXiv Detail & Related papers (2023-02-08T07:11:36Z)
SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method. We modernize the 3D convolutional backbone by introducing multi-head self-attention modules. In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z)
Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE) M3AE learns a unified encoder for both vision and language data via masked token prediction. We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z)
Diverse Instance Discovery: Vision-Transformer for Instance-Aware Multi-Label Image Recognition [24.406654146411682]
Vision Transformer (ViT) is the research base for this paper. Our goal is to leverage ViT's patch tokens and self-attention mechanism to mine rich instances in multi-label images. We propose a weakly supervised object localization-based approach to extract multi-scale local features.
arXiv Detail & Related papers (2022-04-22T14:38:40Z)
Multi-Perspective Anomaly Detection [3.3511723893430476]
We build upon the deep support vector data description algorithm and address multi-perspective anomaly detection. We employ different augmentation techniques with a denoising process to deal with scarce one-class data. We evaluate our approach on the new dices dataset using images from two different perspectives and also benchmark on the standard MNIST dataset.
arXiv Detail & Related papers (2021-05-20T17:07:36Z)
Distribution Alignment: A Unified Framework for Long-tail Visual Recognition [52.36728157779307]
We propose a unified distribution alignment strategy for long-tail visual recognition. We then introduce a generalized re-weight method in the two-stage learning to balance the class prior. Our approach achieves the state-of-the-art results across all four recognition tasks with a simple and unified framework.
arXiv Detail & Related papers (2021-03-30T14:09:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.