Virtual ID Discovery from E-commerce Media at Alibaba: Exploiting
Richness of User Click Behavior for Visual Search Relevance
- URL: http://arxiv.org/abs/2102.04667v1
- Date: Tue, 9 Feb 2021 06:31:20 GMT
- Title: Virtual ID Discovery from E-commerce Media at Alibaba: Exploiting
Richness of User Click Behavior for Visual Search Relevance
- Authors: Yanhao Zhang, Pan Pan, Yun Zheng, Kang Zhao, Jianmin Wu, Yinghui Xu,
Rong Jin
- Abstract summary: We propose to discover Virtual ID from user click behavior to improve visual search relevance at Alibaba.
As a totally click-data driven approach, we collect various types of click data for training deep networks without any human annotations.
Our networks are more effective to encode richer supervision and better distinguish real-shot images in terms of category and feature.
- Score: 40.98749837102654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual search plays an essential role for E-commerce. To meet the search
demands of users and promote shopping experience at Alibaba, visual search
relevance of real-shot images is becoming the bottleneck. Traditional visual
search paradigm is usually based upon supervised learning with labeled data.
However, large-scale categorical labels are required with expensive human
annotations, which limits its applicability and also usually fails in
distinguishing the real-shot images. In this paper, we propose to discover
Virtual ID from user click behavior to improve visual search relevance at
Alibaba. As a totally click-data driven approach, we collect various types of
click data for training deep networks without any human annotations at all. In
particular, Virtual ID are learned as classification supervision with co-click
embedding, which explores image relationship from user co-click behaviors to
guide category prediction and feature learning. Concretely, we deploy Virtual
ID Category Network by integrating first-clicks and switch-clicks as
regularizer. Incorporating triplets and list constraints, Virtual ID Feature
Network is trained in a joint classification and ranking manner. Benefiting
from exploration of user click data, our networks are more effective to encode
richer supervision and better distinguish real-shot images in terms of category
and feature. To validate our method for visual search relevance, we conduct an
extensive set of offline and online experiments on the collected real-shot
images. We consistently achieve better experimental results across all
components, compared with alternative and state-of-the-art methods.
Related papers
- Generalizable Person Search on Open-world User-Generated Video Content [93.72028298712118]
Person search is a challenging task that involves retrieving individuals from a large set of un-cropped scene images.
Existing person search applications are mostly trained and deployed in the same-origin scenarios.
We propose a generalizable framework on both feature-level and data-level generalization to facilitate downstream tasks in arbitrary scenarios.
arXiv Detail & Related papers (2023-10-16T04:59:50Z) - Composed Image Retrieval using Contrastive Learning and Task-oriented
CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one.
We use features from the OpenAI CLIP model to tackle the considered task.
We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z) - Beyond Semantics: Learning a Behavior Augmented Relevance Model with
Self-supervised Learning [25.356999988217325]
Relevance modeling aims to locate desirable items for corresponding queries.
auxiliary query-item interactions extracted from user historical behavior data could provide hints to reveal users' search intents further.
Our model builds multi-level co-attention for distilling coarse-grained and fine-grained semantic representations from both neighbor and target views.
arXiv Detail & Related papers (2023-08-10T06:52:53Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Cloth Interactive Transformer for Virtual Try-On [106.21605249649957]
We propose a novel two-stage cloth interactive transformer (CIT) method for the virtual try-on task.
In the first stage, we design a CIT matching block, aiming to precisely capture the long-range correlations between the cloth-agnostic person information and the in-shop cloth information.
In the second stage, we put forth a CIT reasoning block for establishing global mutual interactive dependencies among person representation, the warped clothing item, and the corresponding warped cloth mask.
arXiv Detail & Related papers (2021-04-12T14:45:32Z) - Connecting Images through Time and Sources: Introducing Low-data,
Heterogeneous Instance Retrieval [3.6526118822907594]
We show that it is not trivial to pick features responding well to a panel of variations and semantic content.
Introducing a new enhanced version of the Alegoria benchmark, we compare descriptors using the detailed annotations.
arXiv Detail & Related papers (2021-03-19T10:54:51Z) - Visual Search at Alibaba [38.106392977338146]
We take advantage of large image collection of Alibaba and state-of-the-art deep learning techniques to perform visual search at scale.
Model and search-based fusion approach is introduced to effectively predict categories.
We propose a deep CNN model for joint detection and feature learning by mining user click behavior.
arXiv Detail & Related papers (2021-02-09T06:46:50Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.