Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval
- URL: http://arxiv.org/abs/2206.08842v1
- Date: Fri, 17 Jun 2022 15:40:45 GMT
- Title: Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval
- Authors: Xiao Dong, Xunlin Zhan, Yunchao Wei, Xiaoyong Wei, Yaowei Wang,
Minlong Lu, Xiaochun Cao, Xiaodan Liang
- Abstract summary: This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
- Score: 152.3504607706575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Our goal in this research is to study a more realistic environment in which
we can conduct weakly-supervised multi-modal instance-level product retrieval
for fine-grained product categories. We first contribute the Product1M
datasets, and define two real practical instance-level retrieval tasks to
enable the evaluations on the price comparison and personalized
recommendations. For both instance-level tasks, how to accurately pinpoint the
product target mentioned in the visual-linguistic data and effectively decrease
the influence of irrelevant contents is quite challenging. To address this, we
exploit to train a more effective cross-modal pertaining model which is
adaptively capable of incorporating key concept information from the
multi-modal data, by using an entity graph whose node and edge respectively
denote the entity and the similarity relation between entities. Specifically, a
novel Entity-Graph Enhanced Cross-Modal Pretraining (EGE-CMP) model is proposed
for instance-level commodity retrieval, that explicitly injects entity
knowledge in both node-based and subgraph-based ways into the multi-modal
networks via a self-supervised hybrid-stream transformer, which could reduce
the confusion between different object contents, thereby effectively guiding
the network to focus on entities with real semantic. Experimental results well
verify the efficacy and generalizability of our EGE-CMP, outperforming several
SOTA cross-modal baselines like CLIP, UNITER and CAPTURE.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - One for all: A novel Dual-space Co-training baseline for Large-scale
Multi-View Clustering [42.92751228313385]
We propose a novel multi-view clustering model, named Dual-space Co-training Large-scale Multi-view Clustering (DSCMC)
The main objective of our approach is to enhance the clustering performance by leveraging co-training in two distinct spaces.
Our algorithm has an approximate linear computational complexity, which guarantees its successful application on large-scale datasets.
arXiv Detail & Related papers (2024-01-28T16:30:13Z) - Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction [13.454953507205278]
Multi-Modal Relation Extraction aims at identifying the relation between two entities in texts that contain visual clues.
We propose a novel MMRE framework to better capture the deeper correlations of text, entity pair, and image/objects.
Our approach achieves excellent performance compared to strong competitors, even in the few-shot situation.
arXiv Detail & Related papers (2023-06-19T15:31:34Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Product1M: Towards Weakly Supervised Instance-Level Product Retrieval
via Cross-modal Pretraining [108.86502855439774]
We investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval.
We contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval.
We propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE)
arXiv Detail & Related papers (2021-07-30T12:11:24Z) - Graph Pattern Loss based Diversified Attention Network for Cross-Modal
Retrieval [10.420129873840578]
Cross-modal retrieval aims to enable flexible retrieval experience by combining multimedia data such as image, video, text, and audio.
One core of unsupervised approaches is to dig the correlations among different object representations to complete satisfied retrieval performance without requiring expensive labels.
We propose a Graph Pattern Loss based Diversified Attention Network(GPLDAN) for unsupervised cross-modal retrieval.
arXiv Detail & Related papers (2021-06-25T10:53:07Z) - CoADNet: Collaborative Aggregation-and-Distribution Networks for
Co-Salient Object Detection [91.91911418421086]
Co-Salient Object Detection (CoSOD) aims at discovering salient objects that repeatedly appear in a given query group containing two or more relevant images.
One challenging issue is how to effectively capture co-saliency cues by modeling and exploiting inter-image relationships.
We present an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images.
arXiv Detail & Related papers (2020-11-10T04:28:11Z) - Mining Implicit Entity Preference from User-Item Interaction Data for
Knowledge Graph Completion via Adversarial Learning [82.46332224556257]
We propose a novel adversarial learning approach by leveraging user interaction data for the Knowledge Graph Completion task.
Our generator is isolated from user interaction data, and serves to improve the performance of the discriminator.
To discover implicit entity preference of users, we design an elaborate collaborative learning algorithms based on graph neural networks.
arXiv Detail & Related papers (2020-03-28T05:47:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.