Related papers: Short-Term Memory Convolutions

Short-Term Memory Convolutions

URL: http://arxiv.org/abs/2302.04331v1
Date: Wed, 8 Feb 2023 20:52:24 GMT
Title: Short-Term Memory Convolutions
Authors: Grzegorz Stefa\'nski, Krzysztof Arendt, Pawe{\l} Daniluk, Bart{\l}omiej Jasik, Artur Szumaczuk
Abstract summary: We propose novel method for minimization of inference time latency and memory consumption, called Short-Term Memory Convolution (STMC) The training of STMC-based models is faster and more stable as the method is based solely on convolutional neural networks (CNNs) In case of speech separation we achieved a 5-fold reduction in inference time and a 2-fold reduction in latency without affecting the output quality.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The real-time processing of time series signals is a critical issue for many real-life applications. The idea of real-time processing is especially important in audio domain as the human perception of sound is sensitive to any kind of disturbance in perceived signals, especially the lag between auditory and visual modalities. The rise of deep learning (DL) models complicated the landscape of signal processing. Although they often have superior quality compared to standard DSP methods, this advantage is diminished by higher latency. In this work we propose novel method for minimization of inference time latency and memory consumption, called Short-Term Memory Convolution (STMC) and its transposed counterpart. The main advantage of STMC is the low latency comparable to long short-term memory (LSTM) networks. Furthermore, the training of STMC-based models is faster and more stable as the method is based solely on convolutional neural networks (CNNs). In this study we demonstrate an application of this solution to a U-Net model for a speech separation task and GhostNet model in acoustic scene classification (ASC) task. In case of speech separation we achieved a 5-fold reduction in inference time and a 2-fold reduction in latency without affecting the output quality. The inference time for ASC task was up to 4 times faster while preserving the original accuracy.

Related papers

OFDM-Standard Compatible SC-NOFS Waveforms for Low-Latency and Jitter-Tolerance Industrial IoT Communications [53.398544571833135]
This work proposes a spectrally efficient irregular Sinc (irSinc) shaping technique, revisiting the traditional Sinc back to 1924. irSinc yields a signal with increased spectral efficiency without sacrificing error performance. Our signal achieves faster data transmission within the same spectral bandwidth through 5G standard signal configuration.
arXiv Detail & Related papers (2024-06-07T09:20:30Z)
RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method. RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z)
Multi-Loss Convolutional Network with Time-Frequency Attention for Speech Enhancement [16.701596804113553]
We explore self-attention in the DPCRN module and design a model called Multi-Loss Convolutional Network with Time-Frequency Attention(MNTFA) for speech enhancement. Compared to DPRNN, axial self-attention greatly reduces the need for memory and computation. We propose a joint training method of a multi-resolution STFT loss and a WavLM loss using a pre-trained WavLM network.
arXiv Detail & Related papers (2023-06-15T08:48:19Z)
A low latency attention module for streaming self-supervised speech representation learning [0.4288177321445912]
Self-latency speech representation learning (SSRL) is a popular use-case for the transformer architecture. We present an implementation of the attention module that enables training of SSRL architectures with low compute and memory requirements. Our implementation also reduces the inference latency from 1.92 to 0.16 seconds.
arXiv Detail & Related papers (2023-02-27T00:44:22Z)
Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities. Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z)
Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification. We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information. SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z)
Ultra-low Latency Spiking Neural Networks with Spatio-Temporal Compression and Synaptic Convolutional Block [4.081968050250324]
Spiking neural networks (SNNs) have neuro-temporal information capability, low processing feature, and high biological plausibility. Neuro-MNIST, CIFAR10-S, DVS128 gesture datasets need to aggregate individual events into frames with a higher temporal resolution for event stream classification. We propose a processing-temporal compression method to aggregate individual events into a few time steps of NIST current to reduce the training and inference latency.
arXiv Detail & Related papers (2022-03-18T15:14:13Z)
A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information. We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF) The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z)
Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment. We implement this algorithm in a real-time robotic system with a microphone array. The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z)
Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR [44.229256049718316]
Streaming attention-based sequence-to-sequence (S2S) models have been proposed to perform online speech recognition with linear-time decoding complexity. In these models, the decisions to generate tokens are delayed compared to the actual acoustic boundaries since their unidirectional encoders lack future information. We propose several strategies during training by leveraging external hard alignments extracted from the hybrid model. Experiments on the Cortana voice search task demonstrate that our proposed methods can significantly reduce the latency, and even improve the recognition accuracy in certain cases on the decoder side.
arXiv Detail & Related papers (2020-04-10T12:24:49Z)
A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency [88.08721721440429]
We develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer. We find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model.
arXiv Detail & Related papers (2020-03-28T05:00:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.