Related papers: Sentiment analysis in non-fixed length audios using a Fully Convolutional Neural Network

Sentiment analysis in non-fixed length audios using a Fully Convolutional Neural Network

URL: http://arxiv.org/abs/2402.02184v1
Date: Sat, 3 Feb 2024 15:26:28 GMT
Title: Sentiment analysis in non-fixed length audios using a Fully Convolutional Neural Network
Authors: Mar\'ia Teresa Garc\'ia-Ord\'as, H\'ector Alaiz-Moret\'on, Jos\'e Alberto Ben\'itez-Andrades, Isa\'ias Garc\'ia-Rodr\'iguez, Oscar Garc\'ia-Olalla and Carmen Benavides
Abstract summary: A sentiment analysis method that is capable of accepting audio of any length, without being fixed a priori, is proposed. Mel spectrogram and Mel Frequency Cepstral Coefficients are used as audio description methods. A Fully Convolutional Neural Network architecture is proposed as a classifier.
Score: 0.3495246564946556
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In this work, a sentiment analysis method that is capable of accepting audio of any length, without being fixed a priori, is proposed. Mel spectrogram and Mel Frequency Cepstral Coefficients are used as audio description methods and a Fully Convolutional Neural Network architecture is proposed as a classifier. The results have been validated using three well known datasets: EMODB, RAVDESS, and TESS. The results obtained were promising, outperforming the state-of-the-art methods. Also, thanks to the fact that the proposed method admits audios of any size, it allows a sentiment analysis to be made in near real time, which is very interesting for a wide range of fields such as call centers, medical consultations, or financial brokers.

Related papers

Optimal Transport Maps are Good Voice Converters [58.42556113055807]
We present a variety of optimal transport algorithms for different data representations, such as mel-spectrograms and latent representation of self-supervised speech models. For the mel-spectogram data representation, we achieve strong results in terms of Frechet Audio Distance (FAD) We achived state-of-the-art results and outperformed existing methods even with limited reference speaker data.
arXiv Detail & Related papers (2024-10-17T22:48:53Z)
BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with Convolutional Cross Attention in Multi-talker Conditions [36.15815562576836]
Time-domain single-channel speech enhancement (SE) still remains challenging to extract the target speaker without prior information on multi-talker conditions. We propose a novel time-domain brain-assisted SE network (BASEN) incorporating electroencephalography (EEG) signals recorded from the listener for extracting the target speaker from monaural speech mixtures.
arXiv Detail & Related papers (2023-05-17T06:40:31Z)
Audio-visual multi-channel speech separation, dereverberation and recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z)
A Study on Robustness to Perturbations for Representations of Environmental Sound [16.361059909912758]
We evaluate two embeddings -- YAMNet, and OpenL$3$ on monophonic (UrbanSound8K) and polyphonic (SONYC UST) datasets. We imitate channel effects by injecting perturbations to the audio signal and measure the shift in the new embeddings with three distance measures.
arXiv Detail & Related papers (2022-03-20T01:04:38Z)
Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing. Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video. We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z)
Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions. A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
Cross-domain Adaptation with Discrepancy Minimization for Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments. We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z)
An Ensemble of Convolutional Neural Networks for Audio Classification [9.174145063580882]
ensembles of CNNs for audio classification are presented and tested on three freely available audio classification datasets. To the best of our knowledge, this is the most extensive study investigating ensembles of CNNs for audio classification.
arXiv Detail & Related papers (2020-07-15T19:41:15Z)
COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations [32.456824945999465]
We propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks.
arXiv Detail & Related papers (2020-06-15T13:17:18Z)
Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions. Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
CURE Dataset: Ladder Networks for Audio Event Classification [15.850545634216484]
There are approximately 3M people with hearing loss who can't perceive events happening around them. This paper establishes the CURE dataset which contains curated set of specific audio events most relevant for people with hearing loss.
arXiv Detail & Related papers (2020-01-12T09:35:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.