Efficient Large-Scale Visual Representation Learning And Evaluation
- URL: http://arxiv.org/abs/2305.13399v5
- Date: Tue, 1 Aug 2023 21:01:04 GMT
- Title: Efficient Large-Scale Visual Representation Learning And Evaluation
- Authors: Eden Dolev, Alaa Awad, Denisa Roberts, Zahra Ebrahimzadeh, Marcin
Mejran, Vaibhav Malpani, and Mahir Yavuz
- Abstract summary: We describe challenges in e-commerce vision applications at scale and highlight methods to efficiently train, evaluate, and serve visual representations.
We present ablation studies evaluating visual representations in several downstream tasks.
We include online results from deployed machine learning systems in production on a large scale e-commerce platform.
- Score: 0.13192560874022083
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Efficiently learning visual representations of items is vital for large-scale
recommendations. In this article we compare several pretrained efficient
backbone architectures, both in the convolutional neural network (CNN) and in
the vision transformer (ViT) family. We describe challenges in e-commerce
vision applications at scale and highlight methods to efficiently train,
evaluate, and serve visual representations. We present ablation studies
evaluating visual representations in several downstream tasks. To this end, we
present a novel multilingual text-to-image generative offline evaluation method
for visually similar recommendation systems. Finally, we include online results
from deployed machine learning systems in production on a large scale
e-commerce platform.
Related papers
- Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - e-CLIP: Large-Scale Vision-Language Representation Learning in
E-commerce [9.46186546774799]
We propose a contrastive learning framework that aligns language and visual models using unlabeled raw product text and images.
We present techniques we used to train large-scale representation learning models and share solutions that address domain-specific challenges.
arXiv Detail & Related papers (2022-07-01T05:16:47Z) - Peripheral Vision Transformer [52.55309200601883]
We take a biologically inspired approach and explore to model peripheral vision in deep neural networks for visual recognition.
We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data.
We evaluate the proposed network, dubbed PerViT, on the large-scale ImageNet dataset and systematically investigate the inner workings of the model for machine perception.
arXiv Detail & Related papers (2022-06-14T12:47:47Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual
Representations [9.6221436745451]
We describe how we generate a dataset with over a billion images via large weakly-supervised pretraining.
We leverage Transformers to replace the traditional convolutional backbone.
We show that large-scale Transformer-based pretraining provides significant benefits to industry computer vision applications.
arXiv Detail & Related papers (2021-08-12T17:58:56Z) - Efficient automated U-Net based tree crown delineation using UAV
multi-spectral imagery on embedded devices [2.7393821783237184]
Delineation approaches provide significant benefits to various domains, including agriculture, environmental and natural disasters monitoring.
Deep learning has transformed computer vision and dramatically improved machine translation, though it requires massive dataset for training and resources for inference.
We propose a U-Net based tree delineation method, which is effectively trained using multi-spectral imagery but can then delineate single-spectrum images.
arXiv Detail & Related papers (2021-07-16T11:17:36Z) - Online Bag-of-Visual-Words Generation for Unsupervised Representation
Learning [59.29452780994169]
We propose a teacher-student scheme to learn representations by training a convnet to reconstruct a bag-of-visual-words (BoW) representation of an image.
Our strategy performs an online training of both the teacher network (whose role is to generate the BoW targets) and the student network (whose role is to learn representations) along with an online update of the visual-words vocabulary.
arXiv Detail & Related papers (2020-12-21T18:31:21Z) - Visual Interest Prediction with Attentive Multi-Task Transfer Learning [6.177155931162925]
We propose a transfer learning and attention mechanism based neural network model to predict visual interest & affective dimensions in digital photos.
Evaluation of our model on the benchmark dataset shows large improvement over current state-of-the-art systems.
arXiv Detail & Related papers (2020-05-26T14:49:34Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.