Learning Instance-Level Representation for Large-Scale Multi-Modal
Pretraining in E-commerce
- URL: http://arxiv.org/abs/2304.02853v1
- Date: Thu, 6 Apr 2023 04:14:41 GMT
- Title: Learning Instance-Level Representation for Large-Scale Multi-Modal
Pretraining in E-commerce
- Authors: Yang Jin, Yongzhi Li, Zehuan Yuan, Yadong Mu
- Abstract summary: We propose an instance-centric multi-modal pretraining paradigm called ECLIP in this work.
To enable the model to focus on the desired product instance without reliance on expensive manual annotations, two specially configured pretext tasks are proposed.
ECLIP surpasses existing methods by a large margin on a broad range of downstream tasks, demonstrating the strong transferability to real-world E-commerce applications.
- Score: 35.73830796500975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper aims to establish a generic multi-modal foundation model that has
the scalable capability to massive downstream applications in E-commerce.
Recently, large-scale vision-language pretraining approaches have achieved
remarkable advances in the general domain. However, due to the significant
differences between natural and product images, directly applying these
frameworks for modeling image-level representations to E-commerce will be
inevitably sub-optimal. To this end, we propose an instance-centric multi-modal
pretraining paradigm called ECLIP in this work. In detail, we craft a decoder
architecture that introduces a set of learnable instance queries to explicitly
aggregate instance-level semantics. Moreover, to enable the model to focus on
the desired product instance without reliance on expensive manual annotations,
two specially configured pretext tasks are further proposed. Pretrained on the
100 million E-commerce-related data, ECLIP successfully extracts more generic,
semantic-rich, and robust representations. Extensive experimental results show
that, without further fine-tuning, ECLIP surpasses existing methods by a large
margin on a broad range of downstream tasks, demonstrating the strong
transferability to real-world E-commerce applications.
Related papers
- eCeLLM: Generalizing Large Language Models for E-commerce from Large-scale, High-quality Instruction Data [12.895762133464103]
We construct ECInstruct, the first open-sourced, large-scale, and high-quality benchmark instruction dataset for e-commerce.
We develop eCeLLM, a series of e-commerce LLMs, by instruction-tuning general-purpose LLMs.
eCeLLM exhibits excellent generalizability to out-of-domain settings, including unseen products and unseen instructions.
arXiv Detail & Related papers (2024-02-13T22:26:24Z) - EcomGPT-CT: Continual Pre-training of E-commerce Large Language Models
with Semi-structured Data [67.8302955948861]
Large Language Models (LLMs) pre-trained on massive corpora have exhibited remarkable performance on various NLP tasks.
Applying these models to specific domains still poses significant challenges, such as lack of domain knowledge.
We focus on domain-specific continual pre-training of LLMs using E-commerce domain as an exemplar.
arXiv Detail & Related papers (2023-12-25T11:31:47Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - EcomGPT: Instruction-tuning Large Language Models with Chain-of-Task
Tasks for E-commerce [68.72104414369635]
We propose the first e-commerce instruction dataset EcomInstruct, with a total of 2.5 million instruction data.
EcomGPT outperforms ChatGPT in term of cross-dataset/task generalization on E-commerce tasks.
arXiv Detail & Related papers (2023-08-14T06:49:53Z) - Boosting Multi-Modal E-commerce Attribute Value Extraction via Unified
Learning Scheme and Dynamic Range Minimization [14.223683006262151]
We propose a novel approach to boost multi-modal e-commerce attribute value extraction via unified learning scheme and dynamic range minimization.
Experiments on the popular multi-modal e-commerce benchmarks show that our approach achieves superior performance over the other state-of-the-art techniques.
arXiv Detail & Related papers (2022-07-15T03:58:04Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - Knowledge Perceived Multi-modal Pretraining in E-commerce [12.012793707741562]
Current multi-modal pretraining methods for image and text modalities lack robustness in the face of modality-missing and modality-noise.
We propose K3M, which introduces knowledge modality in multi-modal pretraining to correct the noise and supplement the missing of image and text modalities.
arXiv Detail & Related papers (2021-08-20T08:01:28Z) - Product1M: Towards Weakly Supervised Instance-Level Product Retrieval
via Cross-modal Pretraining [108.86502855439774]
We investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval.
We contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval.
We propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE)
arXiv Detail & Related papers (2021-07-30T12:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.