Sentiment analysis in non-fixed length audios using a Fully
Convolutional Neural Network
- URL: http://arxiv.org/abs/2402.02184v1
- Date: Sat, 3 Feb 2024 15:26:28 GMT
- Title: Sentiment analysis in non-fixed length audios using a Fully
Convolutional Neural Network
- Authors: Mar\'ia Teresa Garc\'ia-Ord\'as, H\'ector Alaiz-Moret\'on, Jos\'e
Alberto Ben\'itez-Andrades, Isa\'ias Garc\'ia-Rodr\'iguez, Oscar
Garc\'ia-Olalla and Carmen Benavides
- Abstract summary: A sentiment analysis method that is capable of accepting audio of any length, without being fixed a priori, is proposed.
Mel spectrogram and Mel Frequency Cepstral Coefficients are used as audio description methods.
A Fully Convolutional Neural Network architecture is proposed as a classifier.
- Score: 0.3495246564946556
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this work, a sentiment analysis method that is capable of accepting audio
of any length, without being fixed a priori, is proposed. Mel spectrogram and
Mel Frequency Cepstral Coefficients are used as audio description methods and a
Fully Convolutional Neural Network architecture is proposed as a classifier.
The results have been validated using three well known datasets: EMODB,
RAVDESS, and TESS. The results obtained were promising, outperforming the
state-of-the-art methods. Also, thanks to the fact that the proposed method
admits audios of any size, it allows a sentiment analysis to be made in near
real time, which is very interesting for a wide range of fields such as call
centers, medical consultations, or financial brokers.
Related papers
- Show from Tell: Audio-Visual Modelling in Clinical Settings [58.88175583465277]
We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation.
A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose.
The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
arXiv Detail & Related papers (2023-10-25T08:55:48Z) - BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with
Convolutional Cross Attention in Multi-talker Conditions [36.15815562576836]
Time-domain single-channel speech enhancement (SE) still remains challenging to extract the target speaker without prior information on multi-talker conditions.
We propose a novel time-domain brain-assisted SE network (BASEN) incorporating electroencephalography (EEG) signals recorded from the listener for extracting the target speaker from monaural speech mixtures.
arXiv Detail & Related papers (2023-05-17T06:40:31Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - A Study on Robustness to Perturbations for Representations of
Environmental Sound [16.361059909912758]
We evaluate two embeddings -- YAMNet, and OpenL$3$ on monophonic (UrbanSound8K) and polyphonic (SONYC UST) datasets.
We imitate channel effects by injecting perturbations to the audio signal and measure the shift in the new embeddings with three distance measures.
arXiv Detail & Related papers (2022-03-20T01:04:38Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z) - An Ensemble of Convolutional Neural Networks for Audio Classification [9.174145063580882]
ensembles of CNNs for audio classification are presented and tested on three freely available audio classification datasets.
To the best of our knowledge, this is the most extensive study investigating ensembles of CNNs for audio classification.
arXiv Detail & Related papers (2020-07-15T19:41:15Z) - COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio
Representations [32.456824945999465]
We propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags.
We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks.
arXiv Detail & Related papers (2020-06-15T13:17:18Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - CURE Dataset: Ladder Networks for Audio Event Classification [15.850545634216484]
There are approximately 3M people with hearing loss who can't perceive events happening around them.
This paper establishes the CURE dataset which contains curated set of specific audio events most relevant for people with hearing loss.
arXiv Detail & Related papers (2020-01-12T09:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.