Exploring Visual Embedding Spaces Induced by Vision Transformers for Online Auto Parts Marketplaces
- URL: http://arxiv.org/abs/2502.05756v1
- Date: Sun, 09 Feb 2025 03:24:03 GMT
- Title: Exploring Visual Embedding Spaces Induced by Vision Transformers for Online Auto Parts Marketplaces
- Authors: Cameron Armijo, Pablo Rivas,
- Abstract summary: This study examines the capabilities of the Vision Transformer model in generating visual embeddings for images of auto parts sourced from online marketplaces.
By focusing exclusively on single-modality data, the analysis evaluates ViT's potential for detecting patterns indicative of illicit activities.
- Score: 0.0
- License:
- Abstract: This study examines the capabilities of the Vision Transformer (ViT) model in generating visual embeddings for images of auto parts sourced from online marketplaces, such as Craigslist and OfferUp. By focusing exclusively on single-modality data, the analysis evaluates ViT's potential for detecting patterns indicative of illicit activities. The workflow involves extracting high-dimensional embeddings from images, applying dimensionality reduction techniques like Uniform Manifold Approximation and Projection (UMAP) to visualize the embedding space, and using K-Means clustering to categorize similar items. Representative posts nearest to each cluster centroid provide insights into the composition and characteristics of the clusters. While the results highlight the strengths of ViT in isolating visual patterns, challenges such as overlapping clusters and outliers underscore the limitations of single-modal approaches in this domain. This work contributes to understanding the role of Vision Transformers in analyzing online marketplaces and offers a foundation for future advancements in detecting fraudulent or illegal activities.
Related papers
- ZISVFM: Zero-Shot Object Instance Segmentation in Indoor Robotic Environments with Vision Foundation Models [10.858627659431928]
Service robots must effectively recognize and segment unknown objects to enhance their functionality.
Traditional supervised learningbased segmentation techniques require extensive annotated datasets.
This paper proposes a novel approach (ZISVFM) for solving UOIS by leveraging the powerful zero-shot capability of the segment anything model (SAM) and explicit visual representations from a selfsupervised vision transformer (ViT)
arXiv Detail & Related papers (2025-02-05T15:22:20Z) - LaVin-DiT: Large Vision Diffusion Transformer [99.98106406059333]
LaVin-DiT is a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework.
We introduce key innovations to optimize generative performance for vision tasks.
The model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks.
arXiv Detail & Related papers (2024-11-18T12:05:27Z) - Neural Clustering based Visual Representation Learning [61.72646814537163]
Clustering is one of the most classic approaches in machine learning and data analysis.
We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data.
FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
arXiv Detail & Related papers (2024-03-26T06:04:50Z) - DimVis: Interpreting Visual Clusters in Dimensionality Reduction With Explainable Boosting Machine [3.2748787252933442]
DimVis is a tool that employs supervised Explainable Boosting Machine (EBM) models as an interpretation assistant for DR projections.
Our tool facilitates high-dimensional data analysis by providing an interpretation of feature relevance in visual clusters.
arXiv Detail & Related papers (2024-02-10T04:50:36Z) - Vision Transformers Need Registers [26.63912173005165]
We identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks.
We show that this solution fixes that problem entirely for both supervised and self-supervised models.
arXiv Detail & Related papers (2023-09-28T16:45:46Z) - Spatial Transform Decoupling for Oriented Object Detection [43.44237345360947]
Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks.
We present a novel approach, termed Spatial Transform Decoupling (STD), providing a simple-yet-effective solution for oriented object detection with ViTs.
arXiv Detail & Related papers (2023-08-21T08:36:23Z) - Learning Explicit Object-Centric Representations with Vision
Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers.
We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Transforming Feature Space to Interpret Machine Learning Models [91.62936410696409]
This contribution proposes a novel approach that interprets machine-learning models through the lens of feature space transformations.
It can be used to enhance unconditional as well as conditional post-hoc diagnostic tools.
A case study on remote-sensing landcover classification with 46 features is used to demonstrate the potential of the proposed approach.
arXiv Detail & Related papers (2021-04-09T10:48:11Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.