Cross-view Semantic Alignment for Livestreaming Product Recognition
- URL: http://arxiv.org/abs/2308.04912v2
- Date: Sat, 19 Aug 2023 02:00:16 GMT
- Title: Cross-view Semantic Alignment for Livestreaming Product Recognition
- Authors: Wenjie Yang, Yiyi Chen, Yan Li, Yanhua Cheng, Xudong Liu, Quan Chen,
Han Li
- Abstract summary: We present LPR4M, a large-scale multimodal dataset that covers 34 categories.
LPR4M contains diverse videos and noise modality pairs while exhibiting a long-tailed distribution.
A novel Patch Feature Reconstruction loss is proposed to penalize the semantic misalignment between cross-view patches.
- Score: 24.38606354376169
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Live commerce is the act of selling products online through live streaming.
The customer's diverse demands for online products introduce more challenges to
Livestreaming Product Recognition. Previous works have primarily focused on
fashion clothing data or utilize single-modal input, which does not reflect the
real-world scenario where multimodal data from various categories are present.
In this paper, we present LPR4M, a large-scale multimodal dataset that covers
34 categories, comprises 3 modalities (image, video, and text), and is 50x
larger than the largest publicly available dataset. LPR4M contains diverse
videos and noise modality pairs while exhibiting a long-tailed distribution,
resembling real-world problems. Moreover, a cRoss-vIew semantiC alignmEnt
(RICE) model is proposed to learn discriminative instance features from the
image and video views of the products. This is achieved through instance-level
contrastive learning and cross-view patch-level feature propagation. A novel
Patch Feature Reconstruction loss is proposed to penalize the semantic
misalignment between cross-view patches. Extensive experiments demonstrate the
effectiveness of RICE and provide insights into the importance of dataset
diversity and expressivity. The dataset and code are available at
https://github.com/adxcreative/RICE
Related papers
- ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval [28.13183873658186]
E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions.
Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only representation is inadequate.
We propose ASR-enhanced Multimodal Product Representation Learning (AMPere)
arXiv Detail & Related papers (2024-08-06T06:24:10Z) - Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval [32.478352606125306]
We propose a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to focus toward intended products.
A long-rangetemporal graph network is further designed to achieve both instance-level interaction and frame-level matching.
We demonstrate the superior performance of our proposed SGMN model, surpassing the state-of-the-art methods by a substantial margin.
arXiv Detail & Related papers (2024-07-23T07:36:54Z) - Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding [25.4933695784155]
Understanding of video creativity and content often varies among individuals, with differences in focal points and cognitive levels across different ages, experiences, and genders.
To bridge the gap to real-world applications, we introduce a large-scale Subjective Response Indicators for Advertisement Videos dataset.
We developed tasks and protocols to analyze and evaluate the extent of cognitive understanding of video content among different users.
arXiv Detail & Related papers (2024-07-11T03:00:26Z) - Cross-Domain Product Representation Learning for Rich-Content E-Commerce [16.418118040661646]
This paper introduces a large-scale cRoss-dOmain Product Ecognition dataset, called ROPE.
ROPE covers a wide range of product categories and contains over 180,000 products, corresponding to millions of short videos and live streams.
It is the first dataset to cover product pages, short videos, and live streams simultaneously, providing the basis for establishing a unified product representation across different media domains.
arXiv Detail & Related papers (2023-08-10T13:06:05Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Factorized Contrastive Learning: Going Beyond Multi-view Redundancy [116.25342513407173]
This paper proposes FactorCL, a new multimodal representation learning method to go beyond multi-view redundancy.
On large-scale real-world datasets, FactorCL captures both shared and unique information and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-06-08T15:17:04Z) - BiCro: Noisy Correspondence Rectification for Multi-modality Data via
Bi-directional Cross-modal Similarity Consistency [66.8685113725007]
BiCro aims to estimate soft labels for noisy data pairs to reflect their true correspondence degree.
experiments on three popular cross-modal matching datasets demonstrate that BiCro significantly improves the noise-robustness of various matching models.
arXiv Detail & Related papers (2023-03-22T09:33:50Z) - Multi-queue Momentum Contrast for Microvideo-Product Retrieval [57.527227171945796]
We formulate the microvideo-product retrieval task, which is the first attempt to explore the retrieval between the multi-modal and multi-modal instances.
A novel approach named Multi-Queue Momentum Contrast (MQMC) network is proposed for bidirectional retrieval.
A discriminative selection strategy with a multi-queue is used to distinguish the importance of different negatives based on their categories.
arXiv Detail & Related papers (2022-12-22T03:47:14Z) - Multi-Modal Attribute Extraction for E-Commerce [4.626261940793027]
We develop a novel approach to seamlessly combine modalities, which is inspired by our single-modality investigations.
Experiments on Rakuten-Ichiba data provide empirical evidence for the benefits of our approach.
arXiv Detail & Related papers (2022-03-07T14:48:44Z) - Product1M: Towards Weakly Supervised Instance-Level Product Retrieval
via Cross-modal Pretraining [108.86502855439774]
We investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval.
We contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval.
We propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE)
arXiv Detail & Related papers (2021-07-30T12:11:24Z) - Comprehensive Information Integration Modeling Framework for Video
Titling [124.11296128308396]
We integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework.
To tackle this issue, the proposed method consists of two processes, i.e., granular-level interaction modeling and abstraction-level story-line summarization.
We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform.
arXiv Detail & Related papers (2020-06-24T10:38:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.