MuMIC -- Multimodal Embedding for Multi-label Image Classification with
Tempered Sigmoid
- URL: http://arxiv.org/abs/2211.05232v1
- Date: Wed, 2 Nov 2022 17:29:35 GMT
- Title: MuMIC -- Multimodal Embedding for Multi-label Image Classification with
Tempered Sigmoid
- Authors: Fengjun Wang, Sarai Mizrachi, Moran Beladev, Guy Nadav, Gil Amsalem,
Karen Lastmann Assaraf, Hadas Harush Boker
- Abstract summary: Multimodal learning approaches have recently achieved outstanding results in image representation and single-label image classification.
We propose the Multimodal Multi-label Image Classification (MuMIC) framework, which utilizes a hardness-aware tempered sigmoid based Binary Cross Entropy loss function.
MuMIC is capable of providing high classification performance, handling real-world noisy data, supporting zero-shot predictions, and producing domain-specific image embeddings.
- Score: 1.1452732046200158
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-label image classification is a foundational topic in various domains.
Multimodal learning approaches have recently achieved outstanding results in
image representation and single-label image classification. For instance,
Contrastive Language-Image Pretraining (CLIP) demonstrates impressive
image-text representation learning abilities and is robust to natural
distribution shifts. This success inspires us to leverage multimodal learning
for multi-label classification tasks, and benefit from contrastively learnt
pretrained models. We propose the Multimodal Multi-label Image Classification
(MuMIC) framework, which utilizes a hardness-aware tempered sigmoid based
Binary Cross Entropy loss function, thus enables the optimization on
multi-label objectives and transfer learning on CLIP. MuMIC is capable of
providing high classification performance, handling real-world noisy data,
supporting zero-shot predictions, and producing domain-specific image
embeddings. In this study, a total of 120 image classes are defined, and more
than 140K positive annotations are collected on approximately 60K Booking.com
images. The final MuMIC model is deployed on Booking.com Content Intelligence
Platform, and it outperforms other state-of-the-art models with 85.6% GAP@10
and 83.8% GAP on all 120 classes, as well as a 90.1% macro mAP score across 32
majority classes. We summarize the modeling choices which are extensively
tested through ablation studies. To the best of our knowledge, we are the first
to adapt contrastively learnt multimodal pretraining for real-world multi-label
image classification problems, and the innovation can be transferred to other
domains.
Related papers
- Google is all you need: Semi-Supervised Transfer Learning Strategy For Light Multimodal Multi-Task Classification Model [1.8160945635344523]
This study introduces a robust multi-label classification system designed to assign multiple labels to a single image.
We propose a multi-modal classifier that merges advanced image recognition algorithms with Natural Language Processing (NLP) models.
Our proposed classification model combines Convolutional Neural Networks (CNN) for image processing with NLP techniques for analyzing textual description.
arXiv Detail & Related papers (2025-01-03T03:11:17Z) - Diverse and Tailored Image Generation for Zero-shot Multi-label Classification [3.354528906571718]
zero-shot multi-label classification has garnered considerable attention for its capacity to operate predictions on unseen labels without human annotations.
prevailing approaches often use seen classes as imperfect proxies for unseen ones, resulting in suboptimal performance.
We propose an innovative solution: generating synthetic data to construct a training set explicitly tailored for proxyless training on unseen labels.
arXiv Detail & Related papers (2024-04-04T01:34:36Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs)
In particular, we study the importance of various architecture components and data choices.
We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z) - Self-Supervised Open-Ended Classification with Small Visual Language
Models [60.23212389067007]
We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks few-shot abilities for open-ended classification with small visual language models.
By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe.
arXiv Detail & Related papers (2023-09-30T21:41:21Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Diversity is Definitely Needed: Improving Model-Agnostic Zero-shot
Classification via Stable Diffusion [22.237426507711362]
Model-Agnostic Zero-Shot Classification (MA-ZSC) refers to training non-specific classification architectures to classify real images without using any real images during training.
Recent research has demonstrated that generating synthetic training images using diffusion models provides a potential solution to address MA-ZSC.
We propose modifications to the text-to-image generation process using a pre-trained diffusion model to enhance diversity.
arXiv Detail & Related papers (2023-02-07T07:13:53Z) - mc-BEiT: Multi-choice Discretization for Image BERT Pre-training [52.04866462439979]
Image BERT pre-training with masked image modeling (MIM) is a popular practice to cope with self-supervised representation learning.
We introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives.
arXiv Detail & Related papers (2022-03-29T09:08:18Z) - Multi-Label Image Classification with Contrastive Learning [57.47567461616912]
We show that a direct application of contrastive learning can hardly improve in multi-label cases.
We propose a novel framework for multi-label classification with contrastive learning in a fully supervised setting.
arXiv Detail & Related papers (2021-07-24T15:00:47Z) - Personalizing Pre-trained Models [23.145974171912414]
We consider how upstream pretrained models can be leveraged for downstream few-shot, multilabel, and continual learning tasks.
Our model CLIPPER (CLIP PERsonalized) uses image representations from CLIP, a large-scale image representation learning model trained using weak natural language supervision.
arXiv Detail & Related papers (2021-06-02T22:58:47Z) - Semantic Diversity Learning for Zero-Shot Multi-label Classification [14.480713752871523]
This study introduces an end-to-end model training for multi-label zero-shot learning.
We propose to use an embedding matrix having principal embedding vectors trained using a tailored loss function.
In addition, during training, we suggest up-weighting in the loss function image samples presenting higher semantic diversity to encourage the diversity of the embedding matrix.
arXiv Detail & Related papers (2021-05-12T19:39:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.