Large Scale Multimodal Classification Using an Ensemble of Transformer
Models and Co-Attention
- URL: http://arxiv.org/abs/2011.11735v1
- Date: Mon, 23 Nov 2020 21:22:54 GMT
- Title: Large Scale Multimodal Classification Using an Ensemble of Transformer
Models and Co-Attention
- Authors: Varnith Chordia, Vijay Kumar BG
- Abstract summary: We describe our methodology and results for the SIGIR eCom Rakuten Data Challenge.
We employ a dual attention technique to model image-text relationships using pretrained language and image embeddings.
- Score: 2.842794675894731
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate and efficient product classification is significant for E-commerce
applications, as it enables various downstream tasks such as recommendation,
retrieval, and pricing. Items often contain textual and visual information, and
utilizing both modalities usually outperforms classification utilizing either
mode alone. In this paper we describe our methodology and results for the SIGIR
eCom Rakuten Data Challenge. We employ a dual attention technique to model
image-text relationships using pretrained language and image embeddings. While
dual attention has been widely used for Visual Question Answering(VQA) tasks,
ours is the first attempt to apply the concept for multimodal classification.
Related papers
- Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - Attention-based sequential recommendation system using multimodal data [8.110978727364397]
We propose an attention-based sequential recommendation method that employs multimodal data of items such as images, texts, and categories.
The experimental results obtained from the Amazon datasets show that the proposed method outperforms those of conventional sequential recommendation systems.
arXiv Detail & Related papers (2024-05-28T08:41:05Z) - Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases.
We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - e-CLIP: Large-Scale Vision-Language Representation Learning in
E-commerce [9.46186546774799]
We propose a contrastive learning framework that aligns language and visual models using unlabeled raw product text and images.
We present techniques we used to train large-scale representation learning models and share solutions that address domain-specific challenges.
arXiv Detail & Related papers (2022-07-01T05:16:47Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - Extending CLIP for Category-to-image Retrieval in E-commerce [36.386210802938656]
E-commerce provides rich multimodal data that is barely leveraged in practice.
In practice, there is often a mismatch between a textual and a visual representation of a given category.
We introduce the task of category-to-image retrieval in e-commerce and propose a model for the task, CLIP-ITA.
arXiv Detail & Related papers (2021-12-21T15:33:23Z) - Logically at the Factify 2022: Multimodal Fact Verification [2.8914815569249823]
This paper describes our participant system for the multi-modal fact verification (Factify) challenge at AAAI 2022.
Two baseline approaches are proposed and explored including an ensemble model and a multi-modal attention network.
Our best model is ranked first in leaderboard which obtains a weighted average F-measure of 0.77 on both validation and test set.
arXiv Detail & Related papers (2021-12-16T23:34:07Z) - Pre-training Graph Transformer with Multimodal Side Information for
Recommendation [82.4194024706817]
We propose a pre-training strategy to learn item representations by considering both item side information and their relationships.
We develop a novel sampling algorithm named MCNSampling to select contextual neighbors for each item.
The proposed Pre-trained Multimodal Graph Transformer (PMGT) learns item representations with two objectives: 1) graph structure reconstruction, and 2) masked node feature reconstruction.
arXiv Detail & Related papers (2020-10-23T10:30:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.