V$^2$L: Leveraging Vision and Vision-language Models into Large-scale
Product Retrieval
- URL: http://arxiv.org/abs/2207.12994v1
- Date: Tue, 26 Jul 2022 15:53:55 GMT
- Title: V$^2$L: Leveraging Vision and Vision-language Models into Large-scale
Product Retrieval
- Authors: Wenhao Wang, Yifan Sun, Zongxin Yang, Yi Yang
- Abstract summary: This paper introduces our 1st-place solution in eBay eProduct Visual Search Challenge (FGVC9)
We show that combining the vision models and vision-language models brings particular benefits from their complementarity and is a key factor to our superiority.
- Score: 32.28772179053869
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Product retrieval is of great importance in the ecommerce domain. This paper
introduces our 1st-place solution in eBay eProduct Visual Search Challenge
(FGVC9), which is featured for an ensemble of about 20 models from vision
models and vision-language models. While model ensemble is common, we show that
combining the vision models and vision-language models brings particular
benefits from their complementarity and is a key factor to our superiority.
Specifically, for the vision models, we use a two-stage training pipeline which
first learns from the coarse labels provided in the training set and then
conducts fine-grained self-supervised training, yielding a coarse-to-fine
metric learning manner. For the vision-language models, we use the textual
description of the training image as the supervision signals for fine-tuning
the image-encoder (feature extractor). With these designs, our solution
achieves 0.7623 MAR@10, ranking the first place among all the competitors. The
code is available at: \href{https://github.com/WangWenhao0716/V2L}{V$^2$L}.
Related papers
- ViTamin: Designing Scalable Vision Models in the Vision-Language Era [26.878662961209997]
Vision Transformers (ViTs) remain the default choice for the image encoder.
ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy.
ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot accuracy.
arXiv Detail & Related papers (2024-04-02T17:40:29Z) - When Do We Not Need Larger Vision Models? [55.957626371697785]
Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations.
We demonstrate the power of Scaling on Scales (S$2$), whereby a pre-trained and frozen smaller vision model can outperform larger models.
We release a Python package that can apply S$2$ on any vision model with one line of code.
arXiv Detail & Related papers (2024-03-19T17:58:39Z) - Aligning Modalities in Vision Large Language Models via Preference
Fine-tuning [67.62925151837675]
In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning.
Specifically, we propose POVID to generate feedback data with AI models.
We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.
In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
arXiv Detail & Related papers (2024-02-18T00:56:16Z) - InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters.
This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks.
It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z) - MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks.
We propose using unique identifiers for different tasks when training the model.
Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z) - Toward Building General Foundation Models for Language, Vision, and
Vision-Language Understanding Tasks [27.450456238980433]
We propose a new general foundation model, X-FM (the X-Foundation Model)
X-FM has one language encoder, one vision encoder, and one fusion encoder, as well as a new training method.
X-FM can significantly outperform existing general foundation models and perform better than or comparable to existing foundation models specifically for language, vision, or vision-language understanding.
arXiv Detail & Related papers (2023-01-12T15:03:05Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks [38.05496300873095]
Vision language pre-training aims to learn alignments between vision and language from a large amount of data.
We propose to learn multi-grained vision language alignments by a unified pre-training framework.
X$2$-VLM is able to learn unlimited visual concepts associated with diverse text descriptions.
arXiv Detail & Related papers (2022-11-22T16:48:01Z) - UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks.
In contrast to previous models, UViM has the same functional form for all tasks.
We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.