Emergence of Shared Sensory-motor Graphical Language from Visual Input
- URL: http://arxiv.org/abs/2210.06468v1
- Date: Mon, 3 Oct 2022 17:11:18 GMT
- Title: Emergence of Shared Sensory-motor Graphical Language from Visual Input
- Authors: Yoann Lemesle, Tristan Karch, Romain Laroche, Cl\'ement Moulin-Frier,
Pierre-Yves Oudeyer
- Abstract summary: We introduce the Graphical Referential Game (GREG) where a speaker must produce a graphical utterance to name a visual referent object.
The utterances are drawing images produced using dynamical motor primitives combined with a sketching library.
We show that our method allows the emergence of a shared, graphical language with compositional properties.
- Score: 22.23299485364174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The framework of Language Games studies the emergence of languages in
populations of agents. Recent contributions relying on deep learning methods
focused on agents communicating via an idealized communication channel, where
utterances produced by a speaker are directly perceived by a listener. This
comes in contrast with human communication, which instead relies on a
sensory-motor channel, where motor commands produced by the speaker (e.g. vocal
or gestural articulators) result in sensory effects perceived by the listener
(e.g. audio or visual). Here, we investigate if agents can evolve a shared
language when they are equipped with a continuous sensory-motor system to
produce and perceive signs, e.g. drawings. To this end, we introduce the
Graphical Referential Game (GREG) where a speaker must produce a graphical
utterance to name a visual referent object consisting of combinations of MNIST
digits while a listener has to select the corresponding object among distractor
referents, given the produced message. The utterances are drawing images
produced using dynamical motor primitives combined with a sketching library. To
tackle GREG we present CURVES: a multimodal contrastive deep learning mechanism
that represents the energy (alignment) between named referents and utterances
generated through gradient ascent on the learned energy landscape. We, then,
present a set of experiments and metrics based on a systematic compositional
dataset to evaluate the resulting language. We show that our method allows the
emergence of a shared, graphical language with compositional properties.
Related papers
- Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech [29.510756530126837]
We introduce a data-driven method to visually represent articulator motion in MRI videos of the human vocal tract during speech.
We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data.
arXiv Detail & Related papers (2024-09-23T20:19:24Z) - Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - Know your audience: specializing grounded language models with listener
subtraction [20.857795779760917]
We take inspiration from Dixit to formulate a multi-agent image reference game.
We show that finetuning an attention-based adapter between a CLIP vision encoder and a large language model in this contrastive, multi-agent setting gives rise to context-dependent natural language specialization.
arXiv Detail & Related papers (2022-06-16T17:52:08Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded
Language from Percepts and Raw Speech [26.076534338576234]
Learning to understand grounded language, which connects natural language to percepts, is a critical research area.
In this work we demonstrate the feasibility of performing grounded language acquisition on paired visual percepts and raw speech inputs.
arXiv Detail & Related papers (2021-12-27T16:12:30Z) - Emergent Graphical Conventions in a Visual Communication Game [80.79297387339614]
Humans communicate with graphical sketches apart from symbolic languages.
We take the very first step to model and simulate such an evolution process via two neural agents playing a visual communication game.
We devise a novel reinforcement learning method such that agents are evolved jointly towards successful communication and abstract graphical conventions.
arXiv Detail & Related papers (2021-11-28T18:59:57Z) - Passing a Non-verbal Turing Test: Evaluating Gesture Animations
Generated from Speech [6.445605125467574]
In this paper, we propose a novel, data-driven technique for generating gestures directly from speech.
Our approach is based on the application of Generative Adversarial Neural Networks (GANs) to model the correlation rather than causation between speech and gestures.
For the study, we animate the generated gestures on a virtual character. We find that users are not able to distinguish between the generated and the recorded gestures.
arXiv Detail & Related papers (2021-07-01T19:38:43Z) - Language and Visual Entity Relationship Graph for Agent Navigation [54.059606864535304]
Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions.
We propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision.
Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art.
arXiv Detail & Related papers (2020-10-19T08:25:55Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z) - Towards Graph Representation Learning in Emergent Communication [37.8523331078468]
We use graph convolutional networks to support the evolution of language and cooperation in multi-agent systems.
Motivated by an image-based referential game, we propose a graph referential game with varying degrees of complexity.
We show that the emerged communication protocol is robust, that the agents uncover the true factors of variation in the game, and that they learn to generalize beyond the samples encountered during training.
arXiv Detail & Related papers (2020-01-24T15:55:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.