DeepSafety:Multi-level Audio-Text Feature Extraction and Fusion Approach
for Violence Detection in Conversations
- URL: http://arxiv.org/abs/2206.11822v1
- Date: Thu, 23 Jun 2022 16:45:50 GMT
- Title: DeepSafety:Multi-level Audio-Text Feature Extraction and Fusion Approach
for Violence Detection in Conversations
- Authors: Amna Anwar, Eiman Kanjo, Dario Ortega Anderez
- Abstract summary: The choice of words and vocal cues in conversations presents an underexplored rich source of natural language data for personal safety and crime prevention.
We introduce a new information fusion approach that extracts and fuses multi-level features including verbal, vocal, and text as heterogeneous sources of information to detect the extent of violent behaviours in conversations.
- Score: 2.8038382295783943
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Natural Language Processing has recently made understanding human interaction
easier, leading to improved sentimental analysis and behaviour prediction.
However, the choice of words and vocal cues in conversations presents an
underexplored rich source of natural language data for personal safety and
crime prevention. When accompanied by audio analysis, it makes it possible to
understand the context of a conversation, including the level of tension or
rift between people. Building on existing work, we introduce a new information
fusion approach that extracts and fuses multi-level features including verbal,
vocal, and text as heterogeneous sources of information to detect the extent of
violent behaviours in conversations. Our multilevel multimodel fusion framework
integrates four types of information from raw audio signals including
embeddings generated from both BERT and Bi-long short-term memory (LSTM) models
along with the output of 2D CNN applied to Mel-frequency Cepstrum (MFCC) as
well as the output of audio Time-Domain dense layer. The embeddings are then
passed to three-layer FC networks, which serve as a concatenated step. Our
experimental setup revealed that the combination of the multi-level features
from different modalities achieves better performance than using a single one
with F1 Score=0.85. We expect that the findings derived from our method
provides new approaches for violence detection in conversations.
Related papers
- Double Mixture: Towards Continual Event Detection from Speech [60.33088725100812]
Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events.
This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events.
We propose a novel method, 'Double Mixture,' which merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting.
arXiv Detail & Related papers (2024-04-20T06:32:00Z) - AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts [8.809586885539002]
We propose a novel approach utilizing audio-visual multimodal data.
This method enhances audio feature extraction by leveraging Mel Frequency Cepstral Coefficients (MFCC) and Log-Mel spectrogram features alongside a pre-trained VGGish network.
Our method notably improves the accuracy of AU detection by understanding the temporal and contextual nuances of the data, showcasing significant advancements in the comprehension of intricate scenarios.
arXiv Detail & Related papers (2024-03-20T15:37:19Z) - Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction [6.1058750788332325]
We introduce a novel methodology for assessing Emotional Mimicry Intensity (EMI) as part of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild.
Our methodology utilise the Wav2Vec 2.0 architecture, which has been pre-trained on an extensive podcast dataset.
We refine our feature extraction process by employing a fusion technique that combines individual features with a global mean vector.
arXiv Detail & Related papers (2024-03-18T15:32:02Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Multistage linguistic conditioning of convolutional layers for speech
emotion recognition [7.482371204083917]
We investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER)
We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN)
Experiments on the widely used IEMOCAP and MSP-Podcast databases demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline.
arXiv Detail & Related papers (2021-10-13T11:28:04Z) - CTAL: Pre-training Cross-modal Transformer for Audio-and-Language
Representations [20.239063010740853]
We present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language.
We observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification.
arXiv Detail & Related papers (2021-09-01T04:18:19Z) - Pretrained Language Models for Dialogue Generation with Multiple Input
Sources [101.17537614998805]
In this work, we study dialogue models with multiple input sources adapted from the pretrained language model GPT2.
We explore various methods to fuse multiple separate attention information corresponding to different sources.
Our experimental results show that proper fusion methods deliver higher relevance with dialogue history than simple fusion baselines.
arXiv Detail & Related papers (2020-10-15T07:53:28Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.