Are words equally surprising in audio and audio-visual comprehension?
- URL: http://arxiv.org/abs/2307.07277v1
- Date: Fri, 14 Jul 2023 11:17:37 GMT
- Title: Are words equally surprising in audio and audio-visual comprehension?
- Authors: Pranava Madhyastha, Ye Zhang, Gabriella Vigliocco
- Abstract summary: We compare the ERP signature (N400) associated with each word in audio-only and audio-visual presentations of the same verbal stimuli.
Our results indicate that cognitive effort differs significantly between multimodal and unimodal settings.
This highlights the significant impact of local lexical context on cognitive processing in a multimodal environment.
- Score: 13.914373331208774
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We report a controlled study investigating the effect of visual information
(i.e., seeing the speaker) on spoken language comprehension. We compare the ERP
signature (N400) associated with each word in audio-only and audio-visual
presentations of the same verbal stimuli. We assess the extent to which
surprisal measures (which quantify the predictability of words in their lexical
context) are generated on the basis of different types of language models
(specifically n-gram and Transformer models) that predict N400 responses for
each word. Our results indicate that cognitive effort differs significantly
between multimodal and unimodal settings. In addition, our findings suggest
that while Transformer-based models, which have access to a larger lexical
context, provide a better fit in the audio-only setting, 2-gram language models
are more effective in the multimodal setting. This highlights the significant
impact of local lexical context on cognitive processing in a multimodal
environment.
Related papers
- Do Audio-Language Models Understand Linguistic Variations? [42.17718387132912]
Open-vocabulary audio language models (ALMs) represent a promising new paradigm for audio-text retrieval using natural language queries.
We propose RobustCLAP, a novel and compute-efficient technique to learn audio-language representations to linguistic variations.
arXiv Detail & Related papers (2024-10-21T20:55:33Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and
Dataset [53.46019570679092]
We propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation.
VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
It achieves new state-of-the-art performances on series of public cross-modality benchmarks.
arXiv Detail & Related papers (2023-04-17T15:08:15Z) - Unify and Conquer: How Phonetic Feature Representation Affects Polyglot
Text-To-Speech (TTS) [3.57486761615991]
unified representations consistently achieves better cross-lingual synthesis with respect to both naturalness and accent.
Separate representations tend to have an order of magnitude more tokens than unified ones, which may affect model capacity.
arXiv Detail & Related papers (2022-07-04T16:14:57Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation
Learning [2.28438857884398]
We present a novel multi-modal deep neural network architecture that uses speech and text entanglement for learning spoken-word representations.
STEPs-RL is trained in a supervised manner to predict the phonetic sequence of a target spoken-word.
Latent representations produced by our model were able to predict the target phonetic sequences with an accuracy of 89.47%.
arXiv Detail & Related papers (2020-11-23T13:29:16Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.