TextAug: Test time Text Augmentation for Multimodal Person
Re-identification
- URL: http://arxiv.org/abs/2312.01605v1
- Date: Mon, 4 Dec 2023 03:38:04 GMT
- Title: TextAug: Test time Text Augmentation for Multimodal Person
Re-identification
- Authors: Mulham Fawakherji, Eduard Vazquez, Pasquale Giampa, Binod Bhattarai
- Abstract summary: The bottleneck for multimodal deep learning is the need for a large volume of multimodal training examples.
Data augmentation techniques such as cropping, flipping, rotation, etc. are often employed in the image domain to improve the generalization of deep learning models.
In this study, we investigate the effectiveness of two computer vision data augmentation techniques: cutout and cutmix, for text augmentation in multi-modal person re-identification.
- Score: 8.557492202759711
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Multimodal Person Reidentification is gaining popularity in the research
community due to its effectiveness compared to counter-part unimodal
frameworks. However, the bottleneck for multimodal deep learning is the need
for a large volume of multimodal training examples. Data augmentation
techniques such as cropping, flipping, rotation, etc. are often employed in the
image domain to improve the generalization of deep learning models. Augmenting
in other modalities than images, such as text, is challenging and requires
significant computational resources and external data sources. In this study,
we investigate the effectiveness of two computer vision data augmentation
techniques: cutout and cutmix, for text augmentation in multi-modal person
re-identification. Our approach merges these two augmentation strategies into
one strategy called CutMixOut which involves randomly removing words or
sub-phrases from a sentence (Cutout) and blending parts of two or more
sentences to create diverse examples (CutMix) with a certain probability
assigned to each operation. This augmentation was implemented at inference time
without any prior training. Our results demonstrate that the proposed technique
is simple and effective in improving the performance on multiple multimodal
person re-identification benchmarks.
Related papers
- Turbo your multi-modal classification with contrastive learning [17.983460380784337]
In this paper, we propose a novel contrastive learning strategy, called $Turbo$, to promote multi-modal understanding.
Specifically, multi-modal data pairs are sent through the forward pass twice with different hidden dropout masks to get two different representations for each modality.
With these representations, we obtain multiple in-modal and cross-modal contrastive objectives for training.
arXiv Detail & Related papers (2024-09-14T03:15:34Z) - Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition [6.995226697189459]
We employ a multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data.
Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks.
We release our pre-trained models as well as source code publicly.
arXiv Detail & Related papers (2024-04-16T20:51:36Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - Learning Multimodal Data Augmentation in Feature Space [65.54623807628536]
LeMDA is an easy-to-use method that automatically learns to jointly augment multimodal data in feature space.
We show that LeMDA can profoundly improve the performance of multimodal deep learning architectures.
arXiv Detail & Related papers (2022-12-29T20:39:36Z) - Unsupervised Multimodal Language Representations using Convolutional
Autoencoders [5.464072883537924]
We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks.
We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets.
It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
arXiv Detail & Related papers (2021-10-06T18:28:07Z) - Cross-Modal Generalization: Learning in Low Resource Modalities via
Meta-Alignment [99.29153138760417]
Cross-modal generalization is a learning paradigm to train a model that can quickly perform new tasks in a target modality.
We study a key research question: how can we ensure generalization across modalities despite using separate encoders for different source and target modalities?
Our solution is based on meta-alignment, a novel method to align representation spaces using strongly and weakly paired cross-modal data.
arXiv Detail & Related papers (2020-12-04T19:27:26Z) - Multimodal Categorization of Crisis Events in Social Media [81.07061295887172]
We present a new multimodal fusion method that leverages both images and texts as input.
In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities.
We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
arXiv Detail & Related papers (2020-04-10T06:31:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.