CommerceMM: Large-Scale Commerce MultiModal Representation Learning with
Omni Retrieval
- URL: http://arxiv.org/abs/2202.07247v1
- Date: Tue, 15 Feb 2022 08:23:59 GMT
- Title: CommerceMM: Large-Scale Commerce MultiModal Representation Learning with
Omni Retrieval
- Authors: Licheng Yu, Jun Chen, Animesh Sinha, Mengjiao MJ Wang, Hugo Chen,
Tamara L. Berg, Ning Zhang
- Abstract summary: CommerceMM is a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to a piece of content.
We propose another 9 novel cross-modal and cross-pair retrieval tasks, called Omni-Retrieval pre-training.
Our model achieves state-of-the-art performance on 7 commerce-related downstream tasks after fine-tuning.
- Score: 30.607369837039904
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We introduce CommerceMM - a multimodal model capable of providing a diverse
and granular understanding of commerce topics associated to the given piece of
content (image, text, image+text), and having the capability to generalize to a
wide range of tasks, including Multimodal Categorization, Image-Text Retrieval,
Query-to-Product Retrieval, Image-to-Product Retrieval, etc. We follow the
pre-training + fine-tuning training regime and present 5 effective pre-training
tasks on image-text pairs. To embrace more common and diverse commerce data
with text-to-multimodal, image-to-multimodal, and multimodal-to-multimodal
mapping, we propose another 9 novel cross-modal and cross-pair retrieval tasks,
called Omni-Retrieval pre-training. The pre-training is conducted in an
efficient manner with only two forward/backward updates for the combined 14
tasks. Extensive experiments and analysis show the effectiveness of each task.
When combining all pre-training tasks, our model achieves state-of-the-art
performance on 7 commerce-related downstream tasks after fine-tuning.
Additionally, we propose a novel approach of modality randomization to
dynamically adjust our model under different efficiency constraints.
Related papers
- MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs)
In particular, we study the importance of various architecture components and data choices.
We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - MoMo: A shared encoder Model for text, image and multi-Modal
representations [4.812718493682455]
We propose a self-supervised shared encoder model that achieves strong results on several visual, language and multimodal benchmarks.
We use a single transformer with all the encoder layers processing both the text and the image modalities.
arXiv Detail & Related papers (2023-04-11T22:26:10Z) - OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models [72.8156832931841]
Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model.
We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
arXiv Detail & Related papers (2022-12-08T17:07:09Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal
Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously.
We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z) - Knowledge Perceived Multi-modal Pretraining in E-commerce [12.012793707741562]
Current multi-modal pretraining methods for image and text modalities lack robustness in the face of modality-missing and modality-noise.
We propose K3M, which introduces knowledge modality in multi-modal pretraining to correct the noise and supplement the missing of image and text modalities.
arXiv Detail & Related papers (2021-08-20T08:01:28Z) - InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [76.32065400614162]
We propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6.
The model owns strong capability of modeling interaction between the information flows of different modalities.
We propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model.
arXiv Detail & Related papers (2020-03-30T03:13:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.