Multi-queue Momentum Contrast for Microvideo-Product Retrieval
- URL: http://arxiv.org/abs/2212.11471v1
- Date: Thu, 22 Dec 2022 03:47:14 GMT
- Title: Multi-queue Momentum Contrast for Microvideo-Product Retrieval
- Authors: Yali Du, Yinwei Wei, Wei Ji, Fan Liu, Xin Luo and Liqiang Nie
- Abstract summary: We formulate the microvideo-product retrieval task, which is the first attempt to explore the retrieval between the multi-modal and multi-modal instances.
A novel approach named Multi-Queue Momentum Contrast (MQMC) network is proposed for bidirectional retrieval.
A discriminative selection strategy with a multi-queue is used to distinguish the importance of different negatives based on their categories.
- Score: 57.527227171945796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The booming development and huge market of micro-videos bring new e-commerce
channels for merchants. Currently, more micro-video publishers prefer to embed
relevant ads into their micro-videos, which not only provides them with
business income but helps the audiences to discover their interesting products.
However, due to the micro-video recording by unprofessional equipment,
involving various topics and including multiple modalities, it is challenging
to locate the products related to micro-videos efficiently, appropriately, and
accurately. We formulate the microvideo-product retrieval task, which is the
first attempt to explore the retrieval between the multi-modal and multi-modal
instances.
A novel approach named Multi-Queue Momentum Contrast (MQMC) network is
proposed for bidirectional retrieval, consisting of the uni-modal feature and
multi-modal instance representation learning. Moreover, a discriminative
selection strategy with a multi-queue is used to distinguish the importance of
different negatives based on their categories. We collect two large-scale
microvideo-product datasets (MVS and MVS-large) for evaluation and manually
construct the hierarchical category ontology, which covers sundry products in
daily life. Extensive experiments show that MQMC outperforms the
state-of-the-art baselines. Our replication package (including code, dataset,
etc.) is publicly available at https://github.com/duyali2000/MQMC.
Related papers
- Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval [32.478352606125306]
We propose a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to focus toward intended products.
A long-rangetemporal graph network is further designed to achieve both instance-level interaction and frame-level matching.
We demonstrate the superior performance of our proposed SGMN model, surpassing the state-of-the-art methods by a substantial margin.
arXiv Detail & Related papers (2024-07-23T07:36:54Z) - Cross-view Semantic Alignment for Livestreaming Product Recognition [24.38606354376169]
We present LPR4M, a large-scale multimodal dataset that covers 34 categories.
LPR4M contains diverse videos and noise modality pairs while exhibiting a long-tailed distribution.
A novel Patch Feature Reconstruction loss is proposed to penalize the semantic misalignment between cross-view patches.
arXiv Detail & Related papers (2023-08-09T12:23:41Z) - Multi-video Moment Ranking with Multimodal Clue [69.81533127815884]
State-of-the-art work for VCMR is based on two-stage method.
MINUTE outperforms the baselines on TVR and DiDeMo datasets.
arXiv Detail & Related papers (2023-01-29T18:38:13Z) - CONQUER: Contextual Query-aware Ranking for Video Corpus Moment
Retrieval [24.649068267308913]
Video retrieval applications should enable users to retrieve a precise moment from a large video corpus.
We propose a novel model for effective moment localization and ranking.
We conduct studies on two datasets, TVR for closed-world TV episodes and DiDeMo for open-world user-generated videos.
arXiv Detail & Related papers (2021-09-21T08:07:27Z) - Product1M: Towards Weakly Supervised Instance-Level Product Retrieval
via Cross-modal Pretraining [108.86502855439774]
We investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval.
We contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval.
We propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE)
arXiv Detail & Related papers (2021-07-30T12:11:24Z) - DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video
Summarization [127.16984421969529]
We introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS.
DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence.
We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
arXiv Detail & Related papers (2021-05-13T17:33:26Z) - Fashion Focus: Multi-modal Retrieval System for Video Commodity
Localization in E-commerce [18.651201334846352]
We present an innovative demonstration of multi-modal retrieval system called "Fashion Focus"
It enables to exactly localize the product images in the online video as the focuses.
Our system employs two procedures for analysis, including video content structuring and multi-modal retrieval, to automatically achieve accurate video-to-shop matching.
arXiv Detail & Related papers (2021-02-09T09:45:04Z) - VMSMO: Learning to Generate Multimodal Summary for Video-based News
Articles [63.32111010686954]
We propose the task of Video-based Multimodal Summarization with Multimodal Output (VMSMO)
The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article.
We propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator.
arXiv Detail & Related papers (2020-10-12T02:19:16Z) - Predicting the Popularity of Micro-videos with Multimodal Variational
Encoder-Decoder Framework [54.194340961353944]
We propose a multimodal variational encoder-decoder framework for micro-video popularity tasks.
MMVED learns a prediction embedding of a micro-video that is informative to its popularity level.
Experiments conducted on a public dataset and a dataset we collect from Xigua demonstrate the effectiveness of the proposed MMVED framework.
arXiv Detail & Related papers (2020-03-28T06:08:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.