Grounding Language in Multi-Perspective Referential Communication
- URL: http://arxiv.org/abs/2410.03959v1
- Date: Fri, 4 Oct 2024 22:42:30 GMT
- Title: Grounding Language in Multi-Perspective Referential Communication
- Authors: Zineng Tang, Lingjun Mao, Alane Suhr,
- Abstract summary: We introduce a task and dataset for referring expression generation and comprehension in multi-agent embodied environments.
We collect a dataset of 2,970 human-written referring expressions, each paired with human comprehension judgments.
We evaluate the performance of automated models as speakers and listeners paired with human partners, finding that model performance in both reference generation and comprehension lags behind that of pairs of human agents.
- Score: 16.421832484760987
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a task and dataset for referring expression generation and comprehension in multi-agent embodied environments. In this task, two agents in a shared scene must take into account one another's visual perspective, which may be different from their own, to both produce and understand references to objects in a scene and the spatial relations between them. We collect a dataset of 2,970 human-written referring expressions, each paired with human comprehension judgments, and evaluate the performance of automated models as speakers and listeners paired with human partners, finding that model performance in both reference generation and comprehension lags behind that of pairs of human agents. Finally, we experiment training an open-weight speaker model with evidence of communicative success when paired with a listener, resulting in an improvement from 58.9 to 69.3% in communicative success and even outperforming the strongest proprietary model.
Related papers
- SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words [45.2706444740307]
We present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation.
We implement three different models and construct a training set following a similar process as SD-Eval.
The training set contains 1,052.72 hours of speech data and 724.4k utterances.
arXiv Detail & Related papers (2024-06-19T08:46:29Z) - DevBench: A multimodal developmental benchmark for language learning [0.34129029452670606]
We introduce DevBench, a benchmark for evaluating vision-language models on tasks and behavioral data.
We show that DevBench provides a benchmark for comparing models to human language development.
These comparisons highlight ways in which model and human language learning processes diverge.
arXiv Detail & Related papers (2024-06-14T17:49:41Z) - Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z) - Dial2vec: Self-Guided Contrastive Learning of Unsupervised Dialogue
Embeddings [41.79937481022846]
We introduce the task of learning unsupervised dialogue embeddings.
Trivial approaches such as combining pre-trained word or sentence embeddings and encoding through pre-trained language models have been shown to be feasible.
We propose a self-guided contrastive learning approach named dial2vec.
arXiv Detail & Related papers (2022-10-27T11:14:06Z) - Intra-agent speech permits zero-shot task acquisition [13.19051572784014]
We take inspiration from processes of "inner speech" in humans to better understand the role of intra-agent speech in embodied behavior.
We develop algorithms that enable visually grounded captioning with little labeled language data.
We incorporate intra-agent speech into an embodied, mobile manipulator agent operating in a 3D virtual world.
arXiv Detail & Related papers (2022-06-07T09:28:10Z) - Reference-Centric Models for Grounded Collaborative Dialogue [42.48421111626639]
We present a grounded neural dialogue model that successfully collaborates with people in a partially-observable reference game.
We focus on a setting where two agents each observe an overlapping part of a world context and need to identify and agree on some object they share.
Our dialogue agent accurately grounds referents from the partner's utterances using a structured reference resolver, conditions on these referents using a recurrent memory, and uses a pragmatic generation procedure to ensure the partner can resolve the references the agent produces.
arXiv Detail & Related papers (2021-09-10T18:03:54Z) - Probing Task-Oriented Dialogue Representation from Language Models [106.02947285212132]
This paper investigates pre-trained language models to find out which model intrinsically carries the most informative representation for task-oriented dialogue tasks.
We fine-tune a feed-forward layer as the classifier probe on top of a fixed pre-trained language model with annotated labels in a supervised way.
arXiv Detail & Related papers (2020-10-26T21:34:39Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z) - Emergent Communication with World Models [80.55287578801008]
We introduce Language World Models, a class of language-conditional generative model which interpret natural language messages.
We incorporate this "observation" into a persistent memory state, and allow the listening agent's policy to condition on it.
We show this improves effective communication and task success in 2D gridworld speaker-listener navigation tasks.
arXiv Detail & Related papers (2020-02-22T02:34:51Z) - On the interaction between supervision and self-play in emergent
communication [82.290338507106]
We investigate the relationship between two categories of learning signals with the ultimate goal of improving sample efficiency.
We find that first training agents via supervised learning on human data followed by self-play outperforms the converse.
arXiv Detail & Related papers (2020-02-04T02:35:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.