CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification
- URL: http://arxiv.org/abs/2112.03562v2
- Date: Thu, 9 Dec 2021 06:57:24 GMT
- Title: CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification
- Authors: Huidong Liu, Shaoyuan Xu, Jinmiao Fu, Yang Liu, Ning Xie, Chien-Chih
Wang, Bryan Wang, Yi Sun
- Abstract summary: We propose the Cross-Modality Attention Contrastive Language-Image Pre-training (CMA-CLIP)
CMA-CLIP unifies two types of cross-modality attentions, sequence-wise attention and modality-wise attention, to effectively fuse information from image and text pairs.
We conduct experiments on a Major Retail Website Product Attribute (MRWPA) dataset and two public datasets, Food101 and Fashion-Gen.
- Score: 18.78457628409226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern Web systems such as social media and e-commerce contain rich contents
expressed in images and text. Leveraging information from multi-modalities can
improve the performance of machine learning tasks such as classification and
recommendation. In this paper, we propose the Cross-Modality Attention
Contrastive Language-Image Pre-training (CMA-CLIP), a new framework which
unifies two types of cross-modality attentions, sequence-wise attention and
modality-wise attention, to effectively fuse information from image and text
pairs. The sequence-wise attention enables the framework to capture the
fine-grained relationship between image patches and text tokens, while the
modality-wise attention weighs each modality by its relevance to the downstream
tasks. In addition, by adding task specific modality-wise attentions and
multilayer perceptrons, our proposed framework is capable of performing
multi-task classification with multi-modalities.
We conduct experiments on a Major Retail Website Product Attribute (MRWPA)
dataset and two public datasets, Food101 and Fashion-Gen. The results show that
CMA-CLIP outperforms the pre-trained and fine-tuned CLIP by an average of 11.9%
in recall at the same level of precision on the MRWPA dataset for multi-task
classification. It also surpasses the state-of-the-art method on Fashion-Gen
Dataset by 5.5% in accuracy and achieves competitive performance on Food101
Dataset. Through detailed ablation studies, we further demonstrate the
effectiveness of both cross-modality attention modules and our method's
robustness against noise in image and text inputs, which is a common challenge
in practice.
Related papers
- Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification
with Cross-Modal Retrieval [29.838375158101027]
Contrastive language-image pre-training (CLIP) has demonstrated remarkable zero-shot classification ability.
We propose X-MoRe, a novel inference method comprising two key steps: (1) cross-modal retrieval and (2) modal-confidence-based ensemble.
X-MoRe demonstrates robust performance across a diverse set of tasks without the need for additional training.
arXiv Detail & Related papers (2023-08-29T13:02:35Z) - Composed Image Retrieval using Contrastive Learning and Task-oriented
CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one.
We use features from the OpenAI CLIP model to tackle the considered task.
We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z) - Multi-interactive Feature Learning and a Full-time Multi-modality
Benchmark for Image Fusion and Segmentation [66.15246197473897]
Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation.
We propose a textbfMulti-textbfinteractive textbfFeature learning architecture for image fusion and textbfSegmentation.
arXiv Detail & Related papers (2023-08-04T01:03:58Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination.
In this paper, we enrich intra-modality and cross-modality relations for representation modeling.
We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Multimodal Contrastive Training for Visual Representation Learning [45.94662252627284]
We develop an approach to learning visual representations that embraces multimodal data.
Our method exploits intrinsic data properties within each modality and semantic information from cross-modal correlation simultaneously.
By including multimodal training in a unified framework, our method can learn more powerful and generic visual features.
arXiv Detail & Related papers (2021-04-26T19:23:36Z) - Multimodal Categorization of Crisis Events in Social Media [81.07061295887172]
We present a new multimodal fusion method that leverages both images and texts as input.
In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities.
We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
arXiv Detail & Related papers (2020-04-10T06:31:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.