Fashion Focus: Multi-modal Retrieval System for Video Commodity
Localization in E-commerce
- URL: http://arxiv.org/abs/2102.04727v1
- Date: Tue, 9 Feb 2021 09:45:04 GMT
- Title: Fashion Focus: Multi-modal Retrieval System for Video Commodity
Localization in E-commerce
- Authors: Yanhao Zhang, Qiang Wang, Pan Pan, Yun Zheng, Cheng Da, Siyang Sun and
Yinghui Xu
- Abstract summary: We present an innovative demonstration of multi-modal retrieval system called "Fashion Focus"
It enables to exactly localize the product images in the online video as the focuses.
Our system employs two procedures for analysis, including video content structuring and multi-modal retrieval, to automatically achieve accurate video-to-shop matching.
- Score: 18.651201334846352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nowadays, live-stream and short video shopping in E-commerce have grown
exponentially. However, the sellers are required to manually match images of
the selling products to the timestamp of exhibition in the untrimmed video,
resulting in a complicated process. To solve the problem, we present an
innovative demonstration of multi-modal retrieval system called "Fashion
Focus", which enables to exactly localize the product images in the online
video as the focuses. Different modality contributes to the community
localization, including visual content, linguistic features and interaction
context are jointly investigated via presented multi-modal learning. Our system
employs two procedures for analysis, including video content structuring and
multi-modal retrieval, to automatically achieve accurate video-to-shop
matching. Fashion Focus presents a unified framework that can orientate the
consumers towards relevant product exhibitions during watching videos and help
the sellers to effectively deliver the products over search and recommendation.
Related papers
- ContextIQ: A Multimodal Expert-Based Video Retrieval System for Contextual Advertising [2.330164376631038]
Contextual advertising serves ads that are aligned to the content that the user is viewing.
Current text-to-video retrieval models based on joint multimodal training demand large datasets and computational resources.
We introduce ContextIQ, a multimodal expert-based video retrieval system designed specifically for contextual advertising.
arXiv Detail & Related papers (2024-10-29T17:01:05Z) - Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval [32.478352606125306]
We propose a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to focus toward intended products.
A long-rangetemporal graph network is further designed to achieve both instance-level interaction and frame-level matching.
We demonstrate the superior performance of our proposed SGMN model, surpassing the state-of-the-art methods by a substantial margin.
arXiv Detail & Related papers (2024-07-23T07:36:54Z) - Cross-Domain Product Representation Learning for Rich-Content E-Commerce [16.418118040661646]
This paper introduces a large-scale cRoss-dOmain Product Ecognition dataset, called ROPE.
ROPE covers a wide range of product categories and contains over 180,000 products, corresponding to millions of short videos and live streams.
It is the first dataset to cover product pages, short videos, and live streams simultaneously, providing the basis for establishing a unified product representation across different media domains.
arXiv Detail & Related papers (2023-08-10T13:06:05Z) - Multi-queue Momentum Contrast for Microvideo-Product Retrieval [57.527227171945796]
We formulate the microvideo-product retrieval task, which is the first attempt to explore the retrieval between the multi-modal and multi-modal instances.
A novel approach named Multi-Queue Momentum Contrast (MQMC) network is proposed for bidirectional retrieval.
A discriminative selection strategy with a multi-queue is used to distinguish the importance of different negatives based on their categories.
arXiv Detail & Related papers (2022-12-22T03:47:14Z) - Multi-modal Representation Learning for Video Advertisement Content
Structuring [10.45050088240847]
Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions.
Video advertisements contain sufficient and useful multi-modal content like caption and speech.
We propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text.
arXiv Detail & Related papers (2021-09-04T09:08:29Z) - A Multimodal Framework for Video Ads Understanding [64.70769354696019]
We develop a multimodal system to improve the ability of structured analysis of advertising video content.
Our solution achieved a score of 0.2470 measured in consideration of localization and prediction accuracy, ranking fourth in the 2021 TAAC final leaderboard.
arXiv Detail & Related papers (2021-08-29T16:06:00Z) - VMSMO: Learning to Generate Multimodal Summary for Video-based News
Articles [63.32111010686954]
We propose the task of Video-based Multimodal Summarization with Multimodal Output (VMSMO)
The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article.
We propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator.
arXiv Detail & Related papers (2020-10-12T02:19:16Z) - Poet: Product-oriented Video Captioner for E-commerce [124.9936946822493]
In e-commerce, a growing number of user-generated videos are used for product promotion. How to generate video descriptions that narrate the user-preferred product characteristics depicted in the video is vital for successful promoting.
We propose a product-oriented video captioner framework, abbreviated as Poet.
We show that Poet achieves consistent performance improvement over previous methods concerning generation quality, product aspects capturing, and lexical diversity.
arXiv Detail & Related papers (2020-08-16T10:53:46Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z) - Comprehensive Information Integration Modeling Framework for Video
Titling [124.11296128308396]
We integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework.
To tackle this issue, the proposed method consists of two processes, i.e., granular-level interaction modeling and abstraction-level story-line summarization.
We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform.
arXiv Detail & Related papers (2020-06-24T10:38:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.