Emotion Recognition based on Multi-Task Learning Framework in the ABAW4
Challenge
- URL: http://arxiv.org/abs/2207.09373v1
- Date: Tue, 19 Jul 2022 16:18:53 GMT
- Title: Emotion Recognition based on Multi-Task Learning Framework in the ABAW4
Challenge
- Authors: Tenggan Zhang, Chuanhe Liu, Xiaolong Liu, Yuchen Liu, Liyu Meng, Lei
Sun, Wenqiang Jiang, Fengyuan Zhang
- Abstract summary: This paper presents our submission to the Multi-Task Learning (MTL) Challenge of the 4th Affective Behavior Analysis in-the-wild (ABAW) competition.
Based on visual feature representations, we utilize three types of temporal encoder to capture the temporal context information in the video.
Our system achieves the performance of $1.742$ on MTL Challenge validation dataset.
- Score: 12.662242704351563
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents our submission to the Multi-Task Learning (MTL) Challenge
of the 4th Affective Behavior Analysis in-the-wild (ABAW) competition. Based on
visual feature representations, we utilize three types of temporal encoder to
capture the temporal context information in the video, including the
transformer based encoder, LSTM based encoder and GRU based encoder. With the
temporal context-aware representations, we employ multi-task framework to
predict the valence, arousal, expression and AU values of the images. In
addition, smoothing processing is applied to refine the initial valence and
arousal predictions, and a model ensemble strategy is used to combine multiple
results from different model setups. Our system achieves the performance of
$1.742$ on MTL Challenge validation dataset.
Related papers
- MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval [20.612534837883892]
Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images.
In this paper, we propose a two-stage framework to tackle both discrepancies.
MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost.
arXiv Detail & Related papers (2024-10-31T08:49:05Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders.
This technique introduces a fusion network to unify the processing of outputs from different visual experts.
In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z) - Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection [66.72992463712299]
Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training models.
Previous research has demonstrated the efficacy of ViTs, but they still struggle with computational inefficiencies caused by lengthy visual sequences.
We introduce TRIPS, which reduces the visual sequence using a text-guided patch-selection layer in the visual backbone.
Our experimental results reveal that TRIPS delivers a 40% speedup, while maintaining competitive or superior performance on downstream tasks.
arXiv Detail & Related papers (2024-01-11T14:31:30Z) - MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer [12.544216587327387]
We present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video.
The presented Multiscale-Decoder Video (MED-VT) uses multiscale representation throughout and employs an optional input beyond video.
We present a transductive learning scheme through many-to-many label propagation to provide temporally consistent video predictions.
arXiv Detail & Related papers (2023-04-12T15:50:19Z) - Accountable Textual-Visual Chat Learns to Reject Human Instructions in
Image Re-creation [26.933683814025475]
We introduce two novel multimodal datasets: the synthetic CLEVR-ATVC dataset (620K) and the manually pictured Fruit-ATVC dataset (50K).
These datasets incorporate both visual and text-based inputs and outputs.
To facilitate the accountability of multimodal systems in rejecting human requests, similar to language-based ChatGPT conversations, we introduce specific rules as supervisory signals within the datasets.
arXiv Detail & Related papers (2023-03-10T15:35:11Z) - An Ensemble Approach for Multiple Emotion Descriptors Estimation Using
Multi-task Learning [12.589338141771385]
This paper illustrates our submission method to the fourth Affective Behavior Analysis in-the-Wild (ABAW) Competition.
Instead of using only face information, we employ full information from a provided dataset containing face and the context around the face.
The proposed system achieves the performance of 0.917 on the MTL Challenge validation dataset.
arXiv Detail & Related papers (2022-07-22T04:57:56Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - IMAGINE: Image Synthesis by Image-Guided Model Inversion [79.4691654458141]
We introduce an inversion based method, denoted as IMAge-Guided model INvErsion (IMAGINE), to generate high-quality and diverse images.
We leverage the knowledge of image semantics from a pre-trained classifier to achieve plausible generations.
IMAGINE enables the synthesis procedure to simultaneously 1) enforce semantic specificity constraints during the synthesis, 2) produce realistic images without generator training, and 3) give users intuitive control over the generation process.
arXiv Detail & Related papers (2021-04-13T02:00:24Z) - Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture.
We focus on the scene context provided by the visual information, to ground the ASR.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.