Dynamic Behaviour of Connectionist Speech Recognition with Strong
Latency Constraints
- URL: http://arxiv.org/abs/2401.06588v1
- Date: Fri, 12 Jan 2024 14:10:28 GMT
- Title: Dynamic Behaviour of Connectionist Speech Recognition with Strong
Latency Constraints
- Authors: Giampiero Salvi
- Abstract summary: This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints.
The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal.
- Score: 6.5458610824731664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes the use of connectionist techniques in phonetic speech
recognition with strong latency constraints. The constraints are imposed by the
task of deriving the lip movements of a synthetic face in real time from the
speech signal, by feeding the phonetic string into an articulatory synthesiser.
Particular attention has been paid to analysing the interaction between the
time evolution model learnt by the multi-layer perceptrons and the transition
model imposed by the Viterbi decoder, in different latency conditions. Two
experiments were conducted in which the time dependencies in the language model
(LM) were controlled by a parameter. The results show a strong interaction
between the three factors involved, namely the neural network topology, the
length of time dependencies in the LM and the decoder latency.
Related papers
- What does it take to get state of the art in simultaneous speech-to-speech translation? [0.0]
We study the latency characteristics observed in simultaneous speech-to-speech model's performance.
We propose methods to minimize latency spikes and improve overall performance.
arXiv Detail & Related papers (2024-09-02T06:04:07Z) - Semantic-Guided Multimodal Sentiment Decoding with Adversarial Temporal-Invariant Learning [22.54577327204281]
Multimodal sentiment analysis aims to learn representations from different modalities to identify human emotions.
Existing works often neglect the frame-level redundancy inherent in continuous time series, resulting in incomplete modality representations with noise.
We propose temporal-invariant learning for the first time, which constrains the distributional variations over time steps to effectively capture long-term temporal dynamics.
arXiv Detail & Related papers (2024-08-30T03:28:40Z) - Capturing Spectral and Long-term Contextual Information for Speech
Emotion Recognition Using Deep Learning Techniques [0.0]
This research proposes an ensemble model that combines Graph Convolutional Networks (GCN) for processing textual data and the HuBERT transformer for analyzing audio signals.
By combining GCN and HuBERT, our ensemble model can leverage the strengths of both approaches.
Results indicate that the combined model can overcome the limitations of traditional methods, leading to enhanced accuracy in recognizing emotions from speech.
arXiv Detail & Related papers (2023-08-04T06:20:42Z) - Intensity Profile Projection: A Framework for Continuous-Time
Representation Learning for Dynamic Networks [50.2033914945157]
We present a representation learning framework, Intensity Profile Projection, for continuous-time dynamic network data.
The framework consists of three stages: estimating pairwise intensity functions, learning a projection which minimises a notion of intensity reconstruction error.
Moreoever, we develop estimation theory providing tight control on the error of any estimated trajectory, indicating that the representations could even be used in quite noise-sensitive follow-on analyses.
arXiv Detail & Related papers (2023-06-09T15:38:25Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic
Speech Synthesis [59.623780036359655]
Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible speech from captured movement of the speech articulators.
This technique has numerous applications, such as restoring oral communication to people who cannot longer speak due to illness or injury.
We propose a solution to this problem based on the theory of multi-view learning.
arXiv Detail & Related papers (2020-12-30T15:09:02Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z) - Attention and Encoder-Decoder based models for transforming articulatory
movements at different speaking rates [60.02121449986413]
We propose an encoder-decoder architecture using LSTMs which generates smoother predicted articulatory trajectories.
We analyze amplitude of the transformed articulatory movements at different rates compared to their original counterparts.
We observe that AstNet could model both duration and extent of articulatory movements better than the existing transformation techniques.
arXiv Detail & Related papers (2020-06-04T19:33:26Z) - Quaternion Neural Networks for Multi-channel Distant Speech Recognition [25.214316268077244]
A common approach to mitigate this issue consists of equipping the recording devices with multiple microphones.
We propose to capture these inter- and intra- structural dependencies with quaternion neural networks.
We show that a quaternion long-short term memory neural network (QLSTM), trained on thed multi-channel speech signals, outperforms equivalent real-valued LSTM on two different tasks of distant speech recognition.
arXiv Detail & Related papers (2020-05-18T10:26:27Z) - Multi-Time-Scale Convolution for Emotion Recognition from Speech Audio
Signals [7.219077740523682]
We introduce the multi-time-scale (MTS) method to create flexibility towards temporal variations when analyzing audio data.
We evaluate MTS and standard convolutional layers in different architectures for emotion recognition from speech audio, using 4 datasets of different sizes.
arXiv Detail & Related papers (2020-03-06T12:28:04Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.