Audio-Visual Neural Syntax Acquisition
- URL: http://arxiv.org/abs/2310.07654v1
- Date: Wed, 11 Oct 2023 16:54:57 GMT
- Title: Audio-Visual Neural Syntax Acquisition
- Authors: Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel,
Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath,
Yang Zhang, Karen Livescu, James Glass
- Abstract summary: We study phrase structure induction from visually-grounded speech.
We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without ever being exposed to text.
- Score: 91.14892278795892
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study phrase structure induction from visually-grounded speech. The core
idea is to first segment the speech waveform into sequences of word segments,
and subsequently induce phrase structure using the inferred segment-level
continuous representations. We present the Audio-Visual Neural Syntax Learner
(AV-NSL) that learns phrase structure by listening to audio and looking at
images, without ever being exposed to text. By training on paired images and
spoken captions, AV-NSL exhibits the capability to infer meaningful phrase
structures that are comparable to those derived by naturally-supervised text
parsers, for both English and German. Our findings extend prior work in
unsupervised language acquisition from speech and grounded grammar induction,
and present one approach to bridge the gap between the two topics.
Related papers
- Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - VCSE: Time-Domain Visual-Contextual Speaker Extraction Network [54.67547526785552]
We propose a two-stage time-domain visual-contextual speaker extraction network named VCSE.
In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence.
In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues.
arXiv Detail & Related papers (2022-10-09T12:29:38Z) - Unified Speech-Text Pre-training for Speech Translation and Recognition [113.31415771943162]
We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition.
The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning.
It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset.
arXiv Detail & Related papers (2022-04-11T20:59:51Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Text-Free Image-to-Speech Synthesis Using Learned Segmental Units [24.657722909094662]
We present the first model for directly fluent, natural-sounding spoken audio captions for images.
We connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units.
We conduct experiments on the Flickr8k spoken caption dataset and a novel corpus of spoken audio captions collected for the popular MSCOCO dataset.
arXiv Detail & Related papers (2020-12-31T05:28:38Z) - STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation
Learning [2.28438857884398]
We present a novel multi-modal deep neural network architecture that uses speech and text entanglement for learning spoken-word representations.
STEPs-RL is trained in a supervised manner to predict the phonetic sequence of a target spoken-word.
Latent representations produced by our model were able to predict the target phonetic sequences with an accuracy of 89.47%.
arXiv Detail & Related papers (2020-11-23T13:29:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.