CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual
Navigation in Noisy Environments
- URL: http://arxiv.org/abs/2306.04047v2
- Date: Wed, 27 Dec 2023 02:00:30 GMT
- Title: CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual
Navigation in Noisy Environments
- Authors: Xiulong Liu, Sudipta Paul, Moitreya Chatterjee, Anoop Cherian
- Abstract summary: CAVEN is a framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal.
Our results show that our fully-conversational approach leads to nearly an order-of-magnitude improvement in success rate.
- Score: 41.21509045214965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual navigation of an agent towards locating an audio goal is a
challenging task especially when the audio is sporadic or the environment is
noisy. In this paper, we present CAVEN, a Conversation-based Audio-Visual
Embodied Navigation framework in which the agent may interact with a
human/oracle for solving the task of navigating to an audio goal. Specifically,
CAVEN is modeled as a budget-aware partially observable semi-Markov decision
process that implicitly learns the uncertainty in the audio-based navigation
policy to decide when and how the agent may interact with the oracle. Our CAVEN
agent can engage in fully-bidirectional natural language conversations by
producing relevant questions and interpret free-form, potentially noisy
responses from the oracle based on the audio-visual context. To enable such a
capability, CAVEN is equipped with: (i) a trajectory forecasting network that
is grounded in audio-visual cues to produce a potential trajectory to the
estimated goal, and (ii) a natural language based question generation and
reasoning network to pose an interactive question to the oracle or interpret
the oracle's response to produce navigation instructions. To train the
interactive modules, we present a large scale dataset: AVN-Instruct, based on
the Landmark-RxR dataset. To substantiate the usefulness of conversations, we
present experiments on the benchmark audio-goal task using the SoundSpaces
simulator under various noisy settings. Our results reveal that our
fully-conversational approach leads to nearly an order-of-magnitude improvement
in success rate, especially in localizing new sound sources and against methods
that only use uni-directional interaction.
Related papers
- AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments [60.98664330268192]
We present AVLEN -- an interactive agent for Audio-Visual-Language Embodied Navigation.
The goal of AVLEN is to localize an audio event via navigating the 3D visual world.
To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone.
arXiv Detail & Related papers (2022-10-14T16:35:06Z) - Pay Self-Attention to Audio-Visual Navigation [24.18976027602831]
We propose an end-to-end framework to learn chasing after a moving audio target using a context-aware audio-visual fusion strategy.
Our thorough experiments validate the superior performance of FSAAVN in comparison with the state-of-the-arts.
arXiv Detail & Related papers (2022-10-04T03:42:36Z) - Direction-Aware Joint Adaptation of Neural Speech Enhancement and
Recognition in Real Multiparty Conversational Environments [21.493664174262737]
This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication within real multiparty conversational environments.
We propose a semi-supervised adaptation method that jointly updates the mask estimator and the ASR model at run-time using clean speech signals with ground-truth transcriptions and noisy speech signals with highly-confident estimated transcriptions.
arXiv Detail & Related papers (2022-07-15T03:43:35Z) - Audio-video fusion strategies for active speaker detection in meetings [5.61861182374067]
We propose two types of fusion for the detection of the active speaker, combining two visual modalities and an audio modality through neural networks.
For our application context, adding motion information greatly improves performance.
We have shown that attention-based fusion improves performance while reducing the standard deviation.
arXiv Detail & Related papers (2022-06-09T08:20:52Z) - End-to-end Spoken Conversational Question Answering: Task, Dataset and
Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts.
We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows.
Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Semantic Audio-Visual Navigation [93.12180578267186]
We introduce semantic audio-visual navigation, where objects in the environment make sounds consistent with their semantic meaning.
We propose a transformer-based model to tackle this new semantic AudioGoal task.
Our method strongly outperforms existing audio-visual navigation methods by learning to associate semantic, acoustic, and visual cues.
arXiv Detail & Related papers (2020-12-21T18:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.