Weakly-Supervised Text-driven Contrastive Learning for Facial Behavior
Understanding
- URL: http://arxiv.org/abs/2304.00058v2
- Date: Fri, 25 Aug 2023 14:31:35 GMT
- Title: Weakly-Supervised Text-driven Contrastive Learning for Facial Behavior
Understanding
- Authors: Xiang Zhang, Taoyue Wang, Xiaotian Li, Huiyuan Yang and Lijun Yin
- Abstract summary: We introduce a two-stage Contrastive Learning with Text-Embeded framework for Facial behavior understanding.
The first stage is a weakly-supervised contrastive learning method that learns representations from positive-negative pairs constructed using coarse-grained activity information.
The second stage aims to train the recognition of facial expressions or facial action units by maximizing the similarity between image and the corresponding text label names.
- Score: 12.509298933267221
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive learning has shown promising potential for learning robust
representations by utilizing unlabeled data. However, constructing effective
positive-negative pairs for contrastive learning on facial behavior datasets
remains challenging. This is because such pairs inevitably encode the
subject-ID information, and the randomly constructed pairs may push similar
facial images away due to the limited number of subjects in facial behavior
datasets. To address this issue, we propose to utilize activity descriptions,
coarse-grained information provided in some datasets, which can provide
high-level semantic information about the image sequences but is often
neglected in previous studies. More specifically, we introduce a two-stage
Contrastive Learning with Text-Embeded framework for Facial behavior
understanding (CLEF). The first stage is a weakly-supervised contrastive
learning method that learns representations from positive-negative pairs
constructed using coarse-grained activity information. The second stage aims to
train the recognition of facial expressions or facial action units by
maximizing the similarity between image and the corresponding text label names.
The proposed CLEF achieves state-of-the-art performance on three in-the-lab
datasets for AU recognition and three in-the-wild datasets for facial
expression recognition.
Related papers
- SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining [2.9010546489056415]
Vision-language models (VLMs) have made significant strides in cross-modal understanding through paired datasets.
In fashion domain, datasets often exhibit a disparity between the information conveyed in image and text.
We propose Synchronized attentional Masking (SyncMask), which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text.
arXiv Detail & Related papers (2024-04-01T15:01:38Z) - Visual Analytics for Efficient Image Exploration and User-Guided Image
Captioning [35.47078178526536]
Recent advancements in pre-trained large-scale language-image models have ushered in a new era of visual comprehension.
This paper tackles two well-known issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of potential data biases within them; (2) the evaluation of image captions and steering of their generation process.
arXiv Detail & Related papers (2023-11-02T06:21:35Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Revisiting Self-Supervised Contrastive Learning for Facial Expression
Recognition [39.647301516599505]
We revisit the use of self-supervised contrastive learning and explore three core strategies to enforce expression-specific representations.
Experimental results show that our proposed method outperforms the current state-of-the-art self-supervised learning methods.
arXiv Detail & Related papers (2022-10-08T00:04:27Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - Unified Contrastive Learning in Image-Text-Label Space [130.31947133453406]
Unified Contrastive Learning (UniCL) is effective way of learning semantically rich yet discriminative representations.
UniCL stand-alone is a good learner on pure imagelabel data, rivaling the supervised learning methods across three image classification datasets.
arXiv Detail & Related papers (2022-04-07T17:34:51Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Self-supervised Contrastive Learning of Multi-view Facial Expressions [9.949781365631557]
Facial expression recognition (FER) has emerged as an important component of human-computer interaction systems.
We propose Contrastive Learning of Multi-view facial Expressions (CL-MEx) to exploit facial images captured simultaneously from different angles towards FER.
arXiv Detail & Related papers (2021-08-15T11:23:34Z) - Heterogeneous Contrastive Learning: Encoding Spatial Information for
Compact Visual Representations [183.03278932562438]
This paper presents an effective approach that adds spatial information to the encoding stage to alleviate the learning inconsistency between the contrastive objective and strong data augmentation operations.
We show that our approach achieves higher efficiency in visual representations and thus delivers a key message to inspire the future research of self-supervised visual representation learning.
arXiv Detail & Related papers (2020-11-19T16:26:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.