Large Scale Multimodal Classification Using an Ensemble of Transformer
Models and Co-Attention
- URL: http://arxiv.org/abs/2011.11735v1
- Date: Mon, 23 Nov 2020 21:22:54 GMT
- Title: Large Scale Multimodal Classification Using an Ensemble of Transformer
Models and Co-Attention
- Authors: Varnith Chordia, Vijay Kumar BG
- Abstract summary: We describe our methodology and results for the SIGIR eCom Rakuten Data Challenge.
We employ a dual attention technique to model image-text relationships using pretrained language and image embeddings.
- Score: 2.842794675894731
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate and efficient product classification is significant for E-commerce
applications, as it enables various downstream tasks such as recommendation,
retrieval, and pricing. Items often contain textual and visual information, and
utilizing both modalities usually outperforms classification utilizing either
mode alone. In this paper we describe our methodology and results for the SIGIR
eCom Rakuten Data Challenge. We employ a dual attention technique to model
image-text relationships using pretrained language and image embeddings. While
dual attention has been widely used for Visual Question Answering(VQA) tasks,
ours is the first attempt to apply the concept for multimodal classification.
Related papers
- Multimodal Remote Sensing Scene Classification Using VLMs and Dual-Cross Attention Networks [0.8999666725996978]
We propose a novel RSSC framework that integrates text descriptions generated by large vision-language models (VLMs) as an auxiliary modality without incurring expensive manual annotation costs.
Experiments with both quantitative and qualitative evaluation across five RSSC datasets demonstrate that our framework consistently outperforms baseline models.
arXiv Detail & Related papers (2024-12-03T16:24:16Z) - Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - Attention-based sequential recommendation system using multimodal data [8.110978727364397]
We propose an attention-based sequential recommendation method that employs multimodal data of items such as images, texts, and categories.
The experimental results obtained from the Amazon datasets show that the proposed method outperforms those of conventional sequential recommendation systems.
arXiv Detail & Related papers (2024-05-28T08:41:05Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks.
Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored.
We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases.
We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - e-CLIP: Large-Scale Vision-Language Representation Learning in
E-commerce [9.46186546774799]
We propose a contrastive learning framework that aligns language and visual models using unlabeled raw product text and images.
We present techniques we used to train large-scale representation learning models and share solutions that address domain-specific challenges.
arXiv Detail & Related papers (2022-07-01T05:16:47Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - Extending CLIP for Category-to-image Retrieval in E-commerce [36.386210802938656]
E-commerce provides rich multimodal data that is barely leveraged in practice.
In practice, there is often a mismatch between a textual and a visual representation of a given category.
We introduce the task of category-to-image retrieval in e-commerce and propose a model for the task, CLIP-ITA.
arXiv Detail & Related papers (2021-12-21T15:33:23Z) - Logically at the Factify 2022: Multimodal Fact Verification [2.8914815569249823]
This paper describes our participant system for the multi-modal fact verification (Factify) challenge at AAAI 2022.
Two baseline approaches are proposed and explored including an ensemble model and a multi-modal attention network.
Our best model is ranked first in leaderboard which obtains a weighted average F-measure of 0.77 on both validation and test set.
arXiv Detail & Related papers (2021-12-16T23:34:07Z) - Pre-training Graph Transformer with Multimodal Side Information for
Recommendation [82.4194024706817]
We propose a pre-training strategy to learn item representations by considering both item side information and their relationships.
We develop a novel sampling algorithm named MCNSampling to select contextual neighbors for each item.
The proposed Pre-trained Multimodal Graph Transformer (PMGT) learns item representations with two objectives: 1) graph structure reconstruction, and 2) masked node feature reconstruction.
arXiv Detail & Related papers (2020-10-23T10:30:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.