Know your audience: specializing grounded language models with listener
subtraction
- URL: http://arxiv.org/abs/2206.08349v2
- Date: Mon, 1 May 2023 20:39:20 GMT
- Title: Know your audience: specializing grounded language models with listener
subtraction
- Authors: Aaditya K. Singh, David Ding, Andrew Saxe, Felix Hill, Andrew K.
Lampinen
- Abstract summary: We take inspiration from Dixit to formulate a multi-agent image reference game.
We show that finetuning an attention-based adapter between a CLIP vision encoder and a large language model in this contrastive, multi-agent setting gives rise to context-dependent natural language specialization.
- Score: 20.857795779760917
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Effective communication requires adapting to the idiosyncrasies of each
communicative context--such as the common ground shared with each partner.
Humans demonstrate this ability to specialize to their audience in many
contexts, such as the popular game Dixit. We take inspiration from Dixit to
formulate a multi-agent image reference game where a (trained) speaker model is
rewarded for describing a target image such that one (pretrained) listener
model can correctly identify it among distractors, but another listener cannot.
To adapt, the speaker must exploit differences in the knowledge it shares with
the different listeners. We show that finetuning an attention-based adapter
between a CLIP vision encoder and a large language model in this contrastive,
multi-agent setting gives rise to context-dependent natural language
specialization from rewards only, without direct supervision. Through
controlled experiments, we show that training a speaker with two listeners that
perceive differently, using our method, allows the speaker to adapt to the
idiosyncracies of the listeners. Furthermore, we show zero-shot transfer of the
specialization to real-world data. Our experiments demonstrate a method for
specializing grounded language models without direct supervision and highlight
the interesting research challenges posed by complex multi-agent communication.
Related papers
- Speaking the Language of Your Listener: Audience-Aware Adaptation via
Plug-and-Play Theory of Mind [4.052000839878213]
We model a visually grounded referential game between a knowledgeable speaker and a listener with more limited visual and linguistic experience.
We endow our speaker with the ability to adapt its referring expressions via a simulation module that monitors the effectiveness of planned utterances from the listener's perspective.
arXiv Detail & Related papers (2023-05-31T15:17:28Z) - Communication Drives the Emergence of Language Universals in Neural
Agents: Evidence from the Word-order/Case-marking Trade-off [3.631024220680066]
We propose a new Neural-agent Language Learning and Communication framework (NeLLCom) where pairs of speaking and listening agents first learn a miniature language.
We succeed in replicating the trade-off with the new framework without hard-coding specific biases in the agents.
arXiv Detail & Related papers (2023-01-30T17:22:33Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - Intra-agent speech permits zero-shot task acquisition [13.19051572784014]
We take inspiration from processes of "inner speech" in humans to better understand the role of intra-agent speech in embodied behavior.
We develop algorithms that enable visually grounded captioning with little labeled language data.
We incorporate intra-agent speech into an embodied, mobile manipulator agent operating in a 3D virtual world.
arXiv Detail & Related papers (2022-06-07T09:28:10Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Few-shot Language Coordination by Modeling Theory of Mind [95.54446989205117]
We study the task of few-shot $textitlanguage coordination$.
We require the lead agent to coordinate with a $textitpopulation$ of agents with different linguistic abilities.
This requires the ability to model the partner's beliefs, a vital component of human communication.
arXiv Detail & Related papers (2021-07-12T19:26:11Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Self-play for Data Efficient Language Acquisition [20.86261546611472]
We exploit the symmetric nature of communication in order to improve the efficiency and quality of language acquisition in learning agents.
We show that using self-play as a substitute for direct supervision enables the agent to transfer its knowledge across roles.
arXiv Detail & Related papers (2020-10-10T02:09:19Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.