Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup
Consistent Module
- URL: http://arxiv.org/abs/2209.02604v1
- Date: Mon, 22 Aug 2022 03:31:33 GMT
- Title: Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup
Consistent Module
- Authors: Yihe Liu, Ziqi Yuan, Huisheng Mao, Zhiyun Liang, Wanqiuyue Yang,
Yuanzhe Qiu, Tie Cheng, Xiaoteng Li, Hua Xu, Kai Gao
- Abstract summary: Multimodal sentiment analysis (MSA) is an emerging research area due to its potential applications in Human-Computer Interaction (HCI)
In this work, we emphasize making non-verbal cues matter for the MSA task.
- Score: 10.785594919904142
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multimodal sentiment analysis (MSA), which supposes to improve text-based
sentiment analysis with associated acoustic and visual modalities, is an
emerging research area due to its potential applications in Human-Computer
Interaction (HCI). However, the existing researches observe that the acoustic
and visual modalities contribute much less than the textual modality, termed as
text-predominant. Under such circumstances, in this work, we emphasize making
non-verbal cues matter for the MSA task. Firstly, from the resource
perspective, we present the CH-SIMS v2.0 dataset, an extension and enhancement
of the CH-SIMS. Compared with the original dataset, the CH-SIMS v2.0 doubles
its size with another 2121 refined video segments with both unimodal and
multimodal annotations and collects 10161 unlabelled raw video segments with
rich acoustic and visual emotion-bearing context to highlight non-verbal cues
for sentiment prediction. Secondly, from the model perspective, benefiting from
the unimodal annotations and the unsupervised data in the CH-SIMS v2.0, the
Acoustic Visual Mixup Consistent (AV-MC) framework is proposed. The designed
modality mixup module can be regarded as an augmentation, which mixes the
acoustic and visual modalities from different videos. Through drawing
unobserved multimodal context along with the text, the model can learn to be
aware of different non-verbal contexts for sentiment prediction. Our
evaluations demonstrate that both CH-SIMS v2.0 and AV-MC framework enables
further research for discovering emotion-bearing acoustic and visual cues and
paves the path to interpretable end-to-end HCI applications for real-world
scenarios.
Related papers
- Semantic Matters: Multimodal Features for Affective Analysis [5.691287789660795]
We present our methodology for two tasks: the Emotional Mimicry Intensity (EMI) Estimation Challenge and the Behavioural Ambivalence/Hesitancy (BAH) Recognition Challenge.
We utilize a Wav2Vec 2.0 model pre-trained on a large podcast dataset to extract various audio features.
We integrate the textual and visual modality into our analysis, recognizing that semantic content provides valuable contextual cues.
arXiv Detail & Related papers (2025-03-16T11:30:44Z) - BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA [5.840467499436581]
We propose BioD2C: a novel Dual-level Semantic Consistency Constraint Framework for Biomedical VQA.
BioD2C achieves dual-level semantic interaction alignment at both the model and feature levels, enabling the model to adaptively learn visual features based on the question.
In this work, we establish a new dataset, BioVGQ, to address inherent biases in prior datasets by filtering manually-altered images and aligning question-answer pairs with multimodal context.
arXiv Detail & Related papers (2025-03-04T10:39:42Z) - Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content [56.62027582702816]
Multimodal Sentiment Analysis seeks to unravel human emotions by amalgamating text, audio, and visual data.
Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge.
We introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions.
arXiv Detail & Related papers (2024-12-12T11:30:41Z) - Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation.
Our approach can be applied to existing datasets by automatically generating hard negative test captions.
Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z) - T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation [55.57459883629706]
We conduct the first systematic study on compositional text-to-video generation.
We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation.
arXiv Detail & Related papers (2024-07-19T17:58:36Z) - VILAS: Exploring the Effects of Vision and Language Context in Automatic
Speech Recognition [18.19998336526969]
ViLaS (Vision and Language into Automatic Speech Recognition) is a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism.
To explore the effects of integrating vision and language, we create VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese and English versions.
arXiv Detail & Related papers (2023-05-31T16:01:20Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - Support-set based Multi-modal Representation Enhancement for Video
Captioning [121.70886789958799]
We propose a Support-set based Multi-modal Representation Enhancement (SMRE) model to mine rich information in a semantic subspace shared between samples.
Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements.
During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way.
arXiv Detail & Related papers (2022-05-19T03:40:29Z) - Exploring Multi-Modal Representations for Ambiguity Detection &
Coreference Resolution in the SIMMC 2.0 Challenge [60.616313552585645]
We present models for effective Ambiguity Detection and Coreference Resolution in Conversational AI.
Specifically, we use TOD-BERT and LXMERT based models, compare them to a number of baselines and provide ablation experiments.
Our results show that (1) language models are able to exploit correlations in the data to detect ambiguity; and (2) unimodal coreference resolution models can avoid the need for a vision component.
arXiv Detail & Related papers (2022-02-25T12:10:02Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.