Virtual ID Discovery from E-commerce Media at Alibaba: Exploiting
Richness of User Click Behavior for Visual Search Relevance
- URL: http://arxiv.org/abs/2102.04667v1
- Date: Tue, 9 Feb 2021 06:31:20 GMT
- Title: Virtual ID Discovery from E-commerce Media at Alibaba: Exploiting
Richness of User Click Behavior for Visual Search Relevance
- Authors: Yanhao Zhang, Pan Pan, Yun Zheng, Kang Zhao, Jianmin Wu, Yinghui Xu,
Rong Jin
- Abstract summary: We propose to discover Virtual ID from user click behavior to improve visual search relevance at Alibaba.
As a totally click-data driven approach, we collect various types of click data for training deep networks without any human annotations.
Our networks are more effective to encode richer supervision and better distinguish real-shot images in terms of category and feature.
- Score: 40.98749837102654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual search plays an essential role for E-commerce. To meet the search
demands of users and promote shopping experience at Alibaba, visual search
relevance of real-shot images is becoming the bottleneck. Traditional visual
search paradigm is usually based upon supervised learning with labeled data.
However, large-scale categorical labels are required with expensive human
annotations, which limits its applicability and also usually fails in
distinguishing the real-shot images. In this paper, we propose to discover
Virtual ID from user click behavior to improve visual search relevance at
Alibaba. As a totally click-data driven approach, we collect various types of
click data for training deep networks without any human annotations at all. In
particular, Virtual ID are learned as classification supervision with co-click
embedding, which explores image relationship from user co-click behaviors to
guide category prediction and feature learning. Concretely, we deploy Virtual
ID Category Network by integrating first-clicks and switch-clicks as
regularizer. Incorporating triplets and list constraints, Virtual ID Feature
Network is trained in a joint classification and ranking manner. Benefiting
from exploration of user click data, our networks are more effective to encode
richer supervision and better distinguish real-shot images in terms of category
and feature. To validate our method for visual search relevance, we conduct an
extensive set of offline and online experiments on the collected real-shot
images. We consistently achieve better experimental results across all
components, compared with alternative and state-of-the-art methods.
Related papers
- Grounded GUI Understanding for Vision Based Spatial Intelligent Agent: Exemplified by Virtual Reality Apps [41.601579396549404]
We propose the first zero-shot cOntext-sensitive inteRactable GUI ElemeNT dEtection framework for virtual Reality apps, named Orienter.
By imitating human behaviors, Orienter observes and understands the semantic contexts of VR app scenes first, before performing the detection.
arXiv Detail & Related papers (2024-09-17T00:58:00Z) - Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - Generalizable Person Search on Open-world User-Generated Video Content [93.72028298712118]
Person search is a challenging task that involves retrieving individuals from a large set of un-cropped scene images.
Existing person search applications are mostly trained and deployed in the same-origin scenarios.
We propose a generalizable framework on both feature-level and data-level generalization to facilitate downstream tasks in arbitrary scenarios.
arXiv Detail & Related papers (2023-10-16T04:59:50Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Cloth Interactive Transformer for Virtual Try-On [106.21605249649957]
We propose a novel two-stage cloth interactive transformer (CIT) method for the virtual try-on task.
In the first stage, we design a CIT matching block, aiming to precisely capture the long-range correlations between the cloth-agnostic person information and the in-shop cloth information.
In the second stage, we put forth a CIT reasoning block for establishing global mutual interactive dependencies among person representation, the warped clothing item, and the corresponding warped cloth mask.
arXiv Detail & Related papers (2021-04-12T14:45:32Z) - Connecting Images through Time and Sources: Introducing Low-data,
Heterogeneous Instance Retrieval [3.6526118822907594]
We show that it is not trivial to pick features responding well to a panel of variations and semantic content.
Introducing a new enhanced version of the Alegoria benchmark, we compare descriptors using the detailed annotations.
arXiv Detail & Related papers (2021-03-19T10:54:51Z) - Visual Search at Alibaba [38.106392977338146]
We take advantage of large image collection of Alibaba and state-of-the-art deep learning techniques to perform visual search at scale.
Model and search-based fusion approach is introduced to effectively predict categories.
We propose a deep CNN model for joint detection and feature learning by mining user click behavior.
arXiv Detail & Related papers (2021-02-09T06:46:50Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.