Audio-Visual Compound Expression Recognition Method based on Late Modality Fusion and Rule-based Decision
- URL: http://arxiv.org/abs/2403.12687v2
- Date: Fri, 29 Mar 2024 12:45:27 GMT
- Title: Audio-Visual Compound Expression Recognition Method based on Late Modality Fusion and Rule-based Decision
- Authors: Elena Ryumina, Maxim Markitantov, Dmitry Ryumin, Heysem Kaya, Alexey Karpov,
- Abstract summary: This paper presents the results of the SUN team for the Compound Expressions Recognition Challenge of the 6th ABAW Competition.
We propose a novel audio-visual method for compound expression recognition.
- Score: 9.436107335675473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents the results of the SUN team for the Compound Expressions Recognition Challenge of the 6th ABAW Competition. We propose a novel audio-visual method for compound expression recognition. Our method relies on emotion recognition models that fuse modalities at the emotion probability level, while decisions regarding the prediction of compound expressions are based on predefined rules. Notably, our method does not use any training data specific to the target task. Thus, the problem is a zero-shot classification task. The method is evaluated in multi-corpus training and cross-corpus validation setups. Using our proposed method is achieved an F1-score value equals to 22.01% on the C-EXPR-DB test subset. Our findings from the challenge demonstrate that the proposed method can potentially form a basis for developing intelligent tools for annotating audio-visual data in the context of human's basic and compound emotions.
Related papers
- OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion [88.59397418187226]
We propose a novel unified open-vocabulary detection method called OV-DINO.
It is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework.
We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks.
arXiv Detail & Related papers (2024-07-10T17:05:49Z) - Compound Expression Recognition via Multi Model Ensemble [8.529105068848828]
Compound Expression Recognition plays a crucial role in interpersonal interactions.
We propose a solution based on ensemble learning methods for Compound Expression Recognition.
Our method achieves high accuracy on RAF-DB and is able to recognize expressions through zero-shot on certain portions of C-EXPR-DB.
arXiv Detail & Related papers (2024-03-19T09:30:56Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - DenoSent: A Denoising Objective for Self-Supervised Sentence
Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective.
By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form.
Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z) - A Change of Heart: Improving Speech Emotion Recognition through
Speech-to-Text Modality Conversion [0.6767885381740951]
We introduce a modality conversion concept aimed at enhancing emotion recognition performance on the MELD dataset.
We assess our approach through two experiments: first, a method named Modality-Conversion that employs automatic speech recognition (ASR) systems, followed by a text classifier.
Our findings indicate that the first method yields substantial results, while the second method outperforms state-of-the-art (SOTA) speech-based approaches in terms of SER weighted-F1 (WF1) score on the MELD dataset.
arXiv Detail & Related papers (2023-07-21T13:48:11Z) - Multi-modal Expression Recognition with Ensemble Method [9.880739481276835]
multimodal feature combinations extracted by several different pre-trained models are applied to capture more effective emotional information.
For these combinations of visual and audio modal features, we utilize two temporal encoders to explore the temporal contextual information in the data.
Our system achieves the average F1 Score of 0.45774 on the validation set.
arXiv Detail & Related papers (2023-03-17T15:03:58Z) - Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models [95.97506031821217]
We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training.
The method requires a short (3 seconds) sample from the target person, and generation is steered at inference time, without any training steps.
arXiv Detail & Related papers (2022-06-05T19:45:29Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Visualizing Classifier Adjacency Relations: A Case Study in Speaker
Verification and Voice Anti-Spoofing [72.4445825335561]
We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers.
Based upon rank correlations, our method facilitates a visual comparison of classifiers with arbitrary scores.
While the approach is fully versatile and can be applied to any detection task, we demonstrate the method using scores produced by automatic speaker verification and voice anti-spoofing systems.
arXiv Detail & Related papers (2021-06-11T13:03:33Z) - Phrase break prediction with bidirectional encoder representations in
Japanese text-to-speech synthesis [8.391631335854457]
We propose a novel phrase break prediction method that combines implicit features extracted from a pre-trained large language model, a.k.a BERT, and explicit features extracted from BiLSTM with linguistic features.
The proposed method takes account of both representations to extract the latent semantics, which cannot be captured by previous methods.
arXiv Detail & Related papers (2021-04-26T08:29:29Z) - Expression Recognition Analysis in the Wild [9.878384185493623]
We report details and experimental results about a facial expression recognition method based on state-of-the-art methods.
We fine-tuned a SeNet deep learning architecture pre-trained on the well-known VGGFace2 dataset.
This paper is also required by the Affective Behavior Analysis in-the-wild (ABAW) competition in order to evaluate on the test set this approach.
arXiv Detail & Related papers (2021-01-22T17:28:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.