Multi-modal Extreme Classification
- URL: http://arxiv.org/abs/2309.04961v1
- Date: Sun, 10 Sep 2023 08:23:52 GMT
- Title: Multi-modal Extreme Classification
- Authors: Anshul Mittal, Kunal Dahiya, Shreya Malani, Janani Ramaswamy, Seba
Kuruvilla, Jitendra Ajmera, Keng-hao Chang, Sumeet Agarwal, Purushottam Kar,
Manik Varma
- Abstract summary: This paper develops the MUFIN technique for extreme classification (XC) tasks with millions of labels.
MUFIN bridges the gap by reformulating multi-modal categorization as an XC problem with several millions of labels.
MUFIN offers at least 3% higher accuracy than leading text-based, image-based and multi-modal techniques.
- Score: 14.574342454143023
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper develops the MUFIN technique for extreme classification (XC) tasks
with millions of labels where datapoints and labels are endowed with visual and
textual descriptors. Applications of MUFIN to product-to-product recommendation
and bid query prediction over several millions of products are presented.
Contemporary multi-modal methods frequently rely on purely embedding-based
methods. On the other hand, XC methods utilize classifier architectures to
offer superior accuracies than embedding only methods but mostly focus on
text-based categorization tasks. MUFIN bridges this gap by reformulating
multi-modal categorization as an XC problem with several millions of labels.
This presents the twin challenges of developing multi-modal architectures that
can offer embeddings sufficiently expressive to allow accurate categorization
over millions of labels; and training and inference routines that scale
logarithmically in the number of labels. MUFIN develops an architecture based
on cross-modal attention and trains it in a modular fashion using pre-training
and positive and negative mining. A novel product-to-product recommendation
dataset MM-AmazonTitles-300K containing over 300K products was curated from
publicly available amazon.com listings with each product endowed with a title
and multiple images. On the all datasets MUFIN offered at least 3% higher
accuracy than leading text-based, image-based and multi-modal techniques. Code
for MUFIN is available at https://github.com/Extreme-classification/MUFIN
Related papers
- Exploring Fine-grained Retail Product Discrimination with Zero-shot Object Classification Using Vision-Language Models [50.370043676415875]
In smart retail applications, the large number of products and their frequent turnover necessitate reliable zero-shot object classification methods.
We introduce the MIMEX dataset, comprising 28 distinct product categories.
We benchmark the zero-shot object classification performance of state-of-the-art vision-language models (VLMs) on the proposed MIMEX dataset.
arXiv Detail & Related papers (2024-09-23T12:28:40Z) - UniDEC : Unified Dual Encoder and Classifier Training for Extreme Multi-Label Classification [42.36546066941635]
Extreme Multi-label Classification (XMC) involves predicting a subset of relevant labels from an extremely large label space.
This work proposes UniDEC, a novel end-to-end trainable framework which trains the dual encoder and classifier in together.
arXiv Detail & Related papers (2024-05-04T17:27:51Z) - Learning label-label correlations in Extreme Multi-label Classification via Label Features [44.00852282861121]
Extreme Multi-label Text Classification (XMC) involves learning a classifier that can assign an input with a subset of most relevant labels from millions of label choices.
Short-text XMC with label features has found numerous applications in areas such as query-to-ad-phrase matching in search ads, title-based product recommendation, prediction of related searches.
We propose Gandalf, a novel approach which makes use of a label co-occurrence graph to leverage label features as additional data points to supplement the training distribution.
arXiv Detail & Related papers (2024-05-03T21:18:43Z) - Multimodal Prompt Learning for Product Title Generation with Extremely
Limited Labels [66.54691023795097]
We propose a prompt-based approach, i.e., the Multimodal Prompt Learning framework, to generate titles for novel products with limited labels.
We build a set of multimodal prompts from different modalities to preserve the corresponding characteristics and writing styles of novel products.
With the full labelled data for training, our method achieves state-of-the-art results.
arXiv Detail & Related papers (2023-07-05T00:40:40Z) - Reliable Representations Learning for Incomplete Multi-View Partial Multi-Label Classification [78.15629210659516]
In this paper, we propose an incomplete multi-view partial multi-label classification network named RANK.
We break through the view-level weights inherent in existing methods and propose a quality-aware sub-network to dynamically assign quality scores to each view of each sample.
Our model is not only able to handle complete multi-view multi-label datasets, but also works on datasets with missing instances and labels.
arXiv Detail & Related papers (2023-03-30T03:09:25Z) - M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product
Downstream Tasks [94.80043324367858]
We contribute a large-scale dataset, named M5Product, which consists of over 6 million multimodal pairs.
M5Product contains rich information of multiple modalities including image, text, table, video and audio.
arXiv Detail & Related papers (2021-09-09T13:50:22Z) - DECAF: Deep Extreme Classification with Label Features [9.768907751312396]
Extreme multi-label classification (XML) involves tagging a data point with its most relevant subset of labels from an extremely large label set.
Leading XML algorithms scale to millions of labels, but they largely ignore label meta-data such as textual descriptions of the labels.
This paper develops the DECAF algorithm that addresses these challenges by learning models enriched by label metadata.
arXiv Detail & Related papers (2021-08-01T05:36:05Z) - ECLARE: Extreme Classification with Label Graph Correlations [13.429436351837653]
This paper presents ECLARE, a scalable deep learning architecture that incorporates not only label text, but also label correlations, to offer accurate real-time predictions within a few milliseconds.
ECLARE offers predictions that are 2 to 14% more accurate on both publicly available benchmark datasets as well as proprietary datasets for a related products recommendation task sourced from the Bing search engine.
arXiv Detail & Related papers (2021-07-31T15:13:13Z) - Label Disentanglement in Partition-based Extreme Multilabel
Classification [111.25321342479491]
We show that the label assignment problem in partition-based XMC can be formulated as an optimization problem.
We show that our method can successfully disentangle multi-modal labels, leading to state-of-the-art (SOTA) results on four XMC benchmarks.
arXiv Detail & Related papers (2021-06-24T03:24:18Z) - An Empirical Study on Large-Scale Multi-Label Text Classification
Including Few and Zero-Shot Labels [49.036212158261215]
Large-scale Multi-label Text Classification (LMTC) has a wide range of Natural Language Processing (NLP) applications.
Current state-of-the-art LMTC models employ Label-Wise Attention Networks (LWANs)
We show that hierarchical methods based on Probabilistic Label Trees (PLTs) outperform LWANs.
We propose a new state-of-the-art method which combines BERT with LWANs.
arXiv Detail & Related papers (2020-10-04T18:55:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.