Modeling speech recognition and synthesis simultaneously: Encoding and
  decoding lexical and sublexical semantic information into speech with no
  direct access to speech data
        - URL: http://arxiv.org/abs/2203.11476v1
- Date: Tue, 22 Mar 2022 06:04:34 GMT
- Title: Modeling speech recognition and synthesis simultaneously: Encoding and
  decoding lexical and sublexical semantic information into speech with no
  direct access to speech data
- Authors: Ga\v{s}per Begu\v{s}, Alan Zhou
- Abstract summary: We introduce, to our knowledge, the most challenging objective in unsupervised lexical learning: an unsupervised network that must learn to assign unique representations for lexical items.
Strong evidence in favor of lexical learning emerges.
The architecture that combines the production and perception principles is thus able to learn to decode unique information from raw acoustic data in an unsupervised manner without ever accessing real training data.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Human speakers encode information into raw speech which is then decoded by
the listeners. This complex relationship between encoding (production) and
decoding (perception) is often modeled separately. Here, we test how decoding
of lexical and sublexical semantic information can emerge automatically from
raw speech in unsupervised generative deep convolutional networks that combine
both the production and perception principle. We introduce, to our knowledge,
the most challenging objective in unsupervised lexical learning: an
unsupervised network that must learn to assign unique representations for
lexical items with no direct access to training data. We train several models
(ciwGAN and fiwGAN by [1]) and test how the networks classify raw acoustic
lexical items in the unobserved test data. Strong evidence in favor of lexical
learning emerges. The architecture that combines the production and perception
principles is thus able to learn to decode unique information from raw acoustic
data in an unsupervised manner without ever accessing real training data. We
propose a technique to explore lexical and sublexical learned representations
in the classifier network. The results bear implications for both unsupervised
speech synthesis and recognition as well as for unsupervised semantic modeling
as language models increasingly bypass text and operate from raw acoustics.
 
      
        Related papers
        - Universal Speech Token Learning via Low-Bitrate Neural Codec and   Pretrained Representations [23.059241057567956]
 This paper unifies two types of tokens and proposes the UniCodec, a universal speech token learning that encapsulates all semantics of speech.
A low-bitrate neural is leveraged to learn such disentangled discrete representations at global and local scales, with knowledge distilled from self-supervised learned features.
 arXiv  Detail & Related papers  (2025-03-15T12:50:43Z)
- Basic syntax from speech: Spontaneous concatenation in unsupervised deep   neural networks [8.683116789109462]
 We focus on one of the most ubiquitous and elementary suboperation of syntax -- concatenation.
We introduce spontaneous concatenation: a phenomenon where convolutional neural networks (CNNs) trained on acoustic recordings of individual words start generating outputs.
We also propose a potential neural mechanism called disinhibition that outlines a possible neural pathway towards concatenation and compositionality.
 arXiv  Detail & Related papers  (2023-05-02T17:38:21Z)
- Introducing Semantics into Speech Encoders [91.37001512418111]
 We propose an unsupervised way of incorporating semantic information from large language models into self-supervised speech encoders without labeled audio transcriptions.
Our approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts.
 arXiv  Detail & Related papers  (2022-11-15T18:44:28Z)
- Bootstrapping meaning through listening: Unsupervised learning of spoken
  sentence embeddings [4.582129557845177]
 This study tackles the unsupervised learning of semantic representations for spoken utterances.
We propose WavEmbed, a sequential autoencoder that predicts hidden units from a dense representation of speech.
We also propose S-HuBERT to induce meaning through knowledge distillation.
 arXiv  Detail & Related papers  (2022-10-23T21:16:09Z)
- Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
 Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
 arXiv  Detail & Related papers  (2022-05-21T16:52:57Z)
- Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
  Languages [58.43299730989809]
 We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
 arXiv  Detail & Related papers  (2022-05-02T17:59:02Z)
- Knowledge Transfer from Large-scale Pretrained Language Models to
  End-to-end Speech Recognizers [13.372686722688325]
 Training of end-to-end speech recognizers always requires transcribed utterances.
This paper proposes a method for alleviating this issue by transferring knowledge from a language model neural network that can be pretrained with text-only data.
 arXiv  Detail & Related papers  (2022-02-16T07:02:24Z)
- data2vec: A General Framework for Self-supervised Learning in Speech,
  Vision and Language [85.9019051663368]
 data2vec is a framework that uses the same learning method for either speech, NLP or computer vision.
The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup.
 Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance.
 arXiv  Detail & Related papers  (2022-02-07T22:52:11Z)
- Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
  for Low-Resource Speech Recognition [159.9312272042253]
 Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
 arXiv  Detail & Related papers  (2021-09-19T16:39:22Z)
- Generative Adversarial Phonology: Modeling unsupervised phonetic and
  phonological learning with neural networks [0.0]
 Training deep neural networks on well-understood dependencies in speech data can provide new insights into how they learn internal representations.
This paper argues that acquisition of speech can be modeled as a dependency between random space and generated speech data in the Generative Adversarial Network architecture.
We propose a methodology to uncover the network's internal representations that correspond to phonetic and phonological properties.
 arXiv  Detail & Related papers  (2020-06-06T20:31:23Z)
- CiwGAN and fiwGAN: Encoding information in acoustic data to model
  lexical learning with Generative Adversarial Networks [0.0]
 Lexical learning is modeled as emergent from an architecture that forces a deep neural network to output data.
Networks trained on lexical items from TIMIT learn to encode unique information corresponding to lexical items in the form of categorical variables in their latent space.
We show that phonetic and phonological representations learned by the network can be productively recombined and directly paralleled to productivity in human speech.
 arXiv  Detail & Related papers  (2020-06-04T15:33:55Z)
- Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
 We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
 arXiv  Detail & Related papers  (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.