Flexible-modal Deception Detection with Audio-Visual Adapter
- URL: http://arxiv.org/abs/2302.05727v1
- Date: Sat, 11 Feb 2023 15:47:20 GMT
- Title: Flexible-modal Deception Detection with Audio-Visual Adapter
- Authors: Zhaoxu Li, Zitong Yu, Nithish Muthuchamy Selvaraj, Xiaobao Guo,
Bingquan Shen, Adams Wai-Kin Kong, Alex Kot
- Abstract summary: We propose a novel framework to fuse temporal features across two modalities efficiently.
Experiments conducted on two benchmark datasets demonstrate that the proposed method can achieve superior performance.
- Score: 20.6514221670249
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting deception by human behaviors is vital in many fields such as custom
security and multimedia anti-fraud. Recently, audio-visual deception detection
attracts more attention due to its better performance than using only a single
modality. However, in real-world multi-modal settings, the integrity of data
can be an issue (e.g., sometimes only partial modalities are available). The
missing modality might lead to a decrease in performance, but the model still
learns the features of the missed modality. In this paper, to further improve
the performance and overcome the missing modality problem, we propose a novel
Transformer-based framework with an Audio-Visual Adapter (AVA) to fuse temporal
features across two modalities efficiently. Extensive experiments conducted on
two benchmark datasets demonstrate that the proposed method can achieve
superior performance compared with other multi-modal fusion methods under
flexible-modal (multiple and missing modalities) settings.
Related papers
- Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach [29.428067329993173]
We propose a modality invariant multimodal learning method, which is less susceptible to the impact of missing modalities.
It consists of a single-branch network sharing weights across multiple modalities to learn inter-modality representations to maximize performance.
Our proposed method achieves superior performance when all modalities are present as well as in the case of missing modalities during training or testing compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-08-14T10:32:16Z) - A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z) - Fourier Prompt Tuning for Modality-Incomplete Scene Segmentation [37.06795681738417]
Modality-Incomplete Scene (MISS) is a task that encompasses both system-level modality absence and sensor-level modality errors.
We introduce a Missing-aware Modal Switch (MMS) strategy to proactively manage missing modalities during training.
We show an improvement of 5.84% mIoU over the prior state-of-the-art parameter-efficient methods in modality missing.
arXiv Detail & Related papers (2024-01-30T11:46:27Z) - Exploring Missing Modality in Multimodal Egocentric Datasets [89.76463983679058]
We introduce a novel concept -Missing Modality Token (MMT)-to maintain performance even when modalities are absent.
Our method mitigates the performance loss, reducing it from its original $sim 30%$ drop to only $sim 10%$ when half of the test set is modal-incomplete.
arXiv Detail & Related papers (2024-01-21T11:55:42Z) - What Makes for Robust Multi-Modal Models in the Face of Missing
Modalities? [35.19295402483624]
We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective.
We introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA)
UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities.
arXiv Detail & Related papers (2023-10-10T07:47:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing [88.6654909354382]
We present a pure transformer-based framework, dubbed the Flexible Modal Vision Transformer (FM-ViT) for face anti-spoofing.
FM-ViT can flexibly target any single-modal (i.e., RGB) attack scenarios with the help of available multi-modal data.
Experiments demonstrate that the single model trained based on FM-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
arXiv Detail & Related papers (2023-05-05T04:28:48Z) - MA-ViT: Modality-Agnostic Vision Transformers for Face Anti-Spoofing [3.3031006227198003]
We present Modality-Agnostic Vision Transformer (MA-ViT), which aims to improve the performance of arbitrary modal attacks with the help of multi-modal data.
Specifically, MA-ViT adopts the early fusion to aggregate all the available training modalities data and enables flexible testing of any given modal samples.
Experiments demonstrate that the single model trained on MA-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
arXiv Detail & Related papers (2023-04-15T13:03:44Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z) - Self-attention fusion for audiovisual emotion recognition with
incomplete data [103.70855797025689]
We consider the problem of multimodal data analysis with a use case of audiovisual emotion recognition.
We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms.
arXiv Detail & Related papers (2022-01-26T18:04:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.