Mutual Query Network for Multi-Modal Product Image Segmentation
- URL: http://arxiv.org/abs/2306.14399v1
- Date: Mon, 26 Jun 2023 03:18:38 GMT
- Title: Mutual Query Network for Multi-Modal Product Image Segmentation
- Authors: Yun Guo, Wei Feng, Zheng Zhang, Xiancong Ren, Yaoyu Li, Jingjing Lv,
Xin Zhu, Zhangang Lin, Jingping Shao
- Abstract summary: We propose a mutual query network to segment products based on both visual and linguistic modalities.
To promote the research in this field, we also construct a Multi-Modal Product dataset (MMPS)
The proposed method significantly outperforms the state-of-the-art methods on MMPS.
- Score: 13.192334066413837
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Product image segmentation is vital in e-commerce. Most existing methods
extract the product image foreground only based on the visual modality, making
it difficult to distinguish irrelevant products. As product titles contain
abundant appearance information and provide complementary cues for product
image segmentation, we propose a mutual query network to segment products based
on both visual and linguistic modalities. First, we design a language query
vision module to obtain the response of language description in image areas,
thus aligning the visual and linguistic representations across modalities.
Then, a vision query language module utilizes the correlation between visual
and linguistic modalities to filter the product title and effectively suppress
the content irrelevant to the vision in the title. To promote the research in
this field, we also construct a Multi-Modal Product Segmentation dataset
(MMPS), which contains 30,000 images and corresponding titles. The proposed
method significantly outperforms the state-of-the-art methods on MMPS.
Related papers
- Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Unified Vision-Language Representation Modeling for E-Commerce
Same-Style Products Retrieval [12.588713044749177]
Same-style products retrieval plays an important role in e-commerce platforms.
We propose a unified vision-language modeling method for e-commerce same-style products retrieval.
It is capable of cross-modal product-to-product retrieval, as well as style transfer and user-interactive search.
arXiv Detail & Related papers (2023-02-10T07:24:23Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - e-CLIP: Large-Scale Vision-Language Representation Learning in
E-commerce [9.46186546774799]
We propose a contrastive learning framework that aligns language and visual models using unlabeled raw product text and images.
We present techniques we used to train large-scale representation learning models and share solutions that address domain-specific challenges.
arXiv Detail & Related papers (2022-07-01T05:16:47Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - PAM: Understanding Product Images in Cross Product Category Attribute
Extraction [40.332066960433245]
This work proposes a more inclusive framework that fully utilizes different modalities for attribute extraction.
Inspired by recent works in visual question answering, we use a transformer based sequence to sequence model to fuse representations of product text, Optical Character Recognition (OCR) tokens and visual objects detected in the product image.
The framework is further extended with the capability to extract attribute value across multiple product categories with a single model.
arXiv Detail & Related papers (2021-06-08T18:30:17Z) - Cross-Modal Progressive Comprehension for Referring Segmentation [89.58118962086851]
Cross-Modal Progressive (CMPC) scheme to effectively mimic human behaviors.
For image data, our CMPC-I module first employs entity and attribute words to perceive all the related entities that might be considered by the expression.
For video data, our CMPC-V module further exploits action words based on CMPC-I to highlight the correct entity matched with the action cues by temporal graph reasoning.
arXiv Detail & Related papers (2021-05-15T08:55:51Z) - Referring Segmentation in Images and Videos with Cross-Modal
Self-Attention Network [27.792054915363106]
Cross-modal self-attention (CMSA) module to utilize fine details of individual words and the input image or video.
gated multi-level fusion (GMLF) module to selectively integrate self-attentive cross-modal features.
Cross-frame self-attention (CFSA) module to effectively integrate temporal information in consecutive frames.
arXiv Detail & Related papers (2021-02-09T11:27:59Z) - Dense Relational Image Captioning via Multi-task Triple-Stream Networks [95.0476489266988]
We introduce dense captioning, a novel task which aims to generate captions with respect to information between objects in a visual scene.
This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding.
arXiv Detail & Related papers (2020-10-08T09:17:55Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.