Automated Audio Captioning using Transfer Learning and Reconstruction
  Latent Space Similarity Regularization
        - URL: http://arxiv.org/abs/2108.04692v1
- Date: Tue, 10 Aug 2021 13:49:41 GMT
- Title: Automated Audio Captioning using Transfer Learning and Reconstruction
  Latent Space Similarity Regularization
- Authors: Andrew Koh, Fuzhao Xue, Eng Siong Chng
- Abstract summary: We propose an architecture that is able to better leverage the acoustic features provided by PANNs for the Automated Audio Captioning Task.
We also introduce a novel self-supervised objective, Reconstruction Latent Space Similarity Regularization (RLSSR)
- Score: 21.216783537997426
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   In this paper, we examine the use of Transfer Learning using Pretrained Audio
Neural Networks (PANNs), and propose an architecture that is able to better
leverage the acoustic features provided by PANNs for the Automated Audio
Captioning Task. We also introduce a novel self-supervised objective,
Reconstruction Latent Space Similarity Regularization (RLSSR). The RLSSR module
supplements the training of the model by minimizing the similarity between the
encoder and decoder embedding. The combination of both methods allows us to
surpass state of the art results by a significant margin on the Clotho dataset
across several metrics and benchmarks.
 
      
        Related papers
        - Variational Self-Supervised Learning [0.0]
 We present Variational Self-Supervised Learning (VSSL), a novel framework that combines variational inference with self-supervised learning.
A momentum-updated teacher network defines a dynamic, data-dependent prior, while the student encoder produces an approximate posterior from augmented views.
 Experiments on CIFAR-10, CIFAR-100, and ImageNet-100 show that VSSL achieves competitive or superior performance to leading self-supervised methods.
 arXiv  Detail & Related papers  (2025-04-06T01:28:50Z)
- Unified Speech Recognition: A Single Model for Auditory, Visual, and   Audiovisual Inputs [73.74375912785689]
 This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
 arXiv  Detail & Related papers  (2024-11-04T16:46:53Z)
- Codec-ASR: Training Performant Automatic Speech Recognition Systems with   Discrete Speech Representations [16.577870835480585]
 We present a comprehensive analysis on building ASR systems with discrete codes.
We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings.
We introduce a pipeline that outperforms Encodec at similar bit-rate.
 arXiv  Detail & Related papers  (2024-07-03T20:51:41Z)
- Continual Learning for On-Device Speech Recognition using Disentangled
  Conformers [54.32320258055716]
 We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
 arXiv  Detail & Related papers  (2022-12-02T18:58:51Z)
- CTA-RNN: Channel and Temporal-wise Attention RNN Leveraging Pre-trained
  ASR Embeddings for Speech Emotion Recognition [20.02248459288662]
 We propose a novel channel and temporal-wise attention RNN architecture based on the intermediate representations of pre-trained ASR models.
We evaluate our approach on two popular benchmark datasets, IEMOCAP and MSP-IMPROV.
 arXiv  Detail & Related papers  (2022-03-31T13:32:51Z)
- A Mixture of Expert Based Deep Neural Network for Improved ASR [4.993304210475779]
 MixNet is a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR)
In natural speech, overlap in distribution across different acoustic classes is inevitable, which leads to inter-class mis-classification.
Experiments are conducted on a large vocabulary ASR task which show that the proposed architecture provides 13.6% and 10.0% relative reduction in word error rates.
 arXiv  Detail & Related papers  (2021-12-02T07:26:34Z)
- Neural Model Reprogramming with Similarity Based Mapping for
  Low-Resource Spoken Command Recognition [71.96870151495536]
 We propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR)
The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model.
We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech.
 arXiv  Detail & Related papers  (2021-10-08T05:07:35Z)
- Feature Replacement and Combination for Hybrid ASR Systems [47.74348197215634]
 We investigate the usefulness of one of these front-end frameworks, namely wav2vec, for hybrid ASR systems.
In addition to deploying a pre-trained feature extractor, we explore how to make use of an existing acoustic model (AM) trained on the same task with different features.
We obtain a relative improvement of 4% and 6% over our previous best model on the LibriSpeech test-clean and test-other sets.
 arXiv  Detail & Related papers  (2021-04-09T11:04:58Z)
- PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive
  Learning [109.84770951839289]
 We present PredRNN, a new recurrent network for learning visual dynamics from historical context.
We show that our approach obtains highly competitive results on three standard datasets.
 arXiv  Detail & Related papers  (2021-03-17T08:28:30Z)
- Train your classifier first: Cascade Neural Networks Training from upper
  layers to lower layers [54.47911829539919]
 We develop a novel top-down training method which can be viewed as an algorithm for searching for high-quality classifiers.
We tested this method on automatic speech recognition (ASR) tasks and language modelling tasks.
The proposed method consistently improves recurrent neural network ASR models on Wall Street Journal, self-attention ASR models on Switchboard, and AWD-LSTM language models on WikiText-2.
 arXiv  Detail & Related papers  (2021-02-09T08:19:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.