Related papers: BOSS: A Benchmark for Human Belief Prediction in Object-context Scenarios

BOSS: A Benchmark for Human Belief Prediction in Object-context Scenarios

URL: http://arxiv.org/abs/2206.10665v1
Date: Tue, 21 Jun 2022 18:29:17 GMT
Title: BOSS: A Benchmark for Human Belief Prediction in Object-context Scenarios
Authors: Jiafei Duan, Samson Yu, Nicholas Tan, Li Yi, Cheston Tan
Abstract summary: This paper uses the combined knowledge of Theory of Mind (ToM) and Object-Context Relations to investigate methods for enhancing collaboration between humans and autonomous systems. We propose a novel and challenging multimodal video dataset for assessing the capability of artificial intelligence (AI) systems in predicting human belief states in an object-context scenario.
Score: 14.23697277904244
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans with an average level of social cognition can infer the beliefs of others based solely on the nonverbal communication signals (e.g. gaze, gesture, pose and contextual information) exhibited during social interactions. This social cognitive ability to predict human beliefs and intentions is more important than ever for ensuring safe human-robot interaction and collaboration. This paper uses the combined knowledge of Theory of Mind (ToM) and Object-Context Relations to investigate methods for enhancing collaboration between humans and autonomous systems in environments where verbal communication is prohibited. We propose a novel and challenging multimodal video dataset for assessing the capability of artificial intelligence (AI) systems in predicting human belief states in an object-context scenario. The proposed dataset consists of precise labelling of human belief state ground-truth and multimodal inputs replicating all nonverbal communication inputs captured by human perception. We further evaluate our dataset with existing deep learning models and provide new insights into the effects of the various input modalities and object-context relations on the performance of the baseline models.

Related papers

Spatial Language Likelihood Grounding Network for Bayesian Fusion of Human-Robot Observations [4.008130792416869]
An uncertainty-aware fusion framework requires a grounded likelihood representing the uncertainty of human inputs.<n>This paper presents a Feature Pyramid Likelihood Grounding Network (FP-LGN) that grounds spatial language by learning relevant map image features.<n> Collaborative sensing results demonstrated that the grounded likelihood successfully enabled uncertainty-aware fusion of heterogeneous human language observations and robot sensor measurements.
arXiv Detail & Related papers (2025-07-26T13:24:02Z)
Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues. Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z)
Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation [70.52558242336988]
We focus on predicting engagement in dyadic interactions by scrutinizing verbal and non-verbal cues, aiming to detect signs of disinterest or confusion. In this work, we collect a dataset featuring 34 participants engaged in casual dyadic conversations, each providing self-reported engagement ratings at the end of each conversation. We introduce a novel fusion strategy using Large Language Models (LLMs) to integrate multiple behavior modalities into a multimodal transcript''
arXiv Detail & Related papers (2024-09-13T18:28:12Z)
A Multi-Modal Explainability Approach for Human-Aware Robots in Multi-Party Conversation [39.87346821309096]
We present an addressee estimation model with improved performance in comparison with the previous SOTA. We also propose several ways to incorporate explainability and transparency in the aforementioned architecture.
arXiv Detail & Related papers (2024-05-20T13:09:32Z)
Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task [17.190635800969456]
In this paper, we examine using Large Language Models to infer human intention in a collaborative object categorization task with a physical robot. We propose a novel multimodal approach that integrates user non-verbal cues, like hand gestures, body poses, and facial expressions, with environment states and user verbal cues to predict user intentions.
arXiv Detail & Related papers (2024-04-12T12:15:14Z)
Multi-Agent Dynamic Relational Reasoning for Social Robot Navigation [50.01551945190676]
Social robot navigation can be helpful in various contexts of daily life but requires safe human-robot interactions and efficient trajectory planning. We propose a systematic relational reasoning approach with explicit inference of the underlying dynamically evolving relational structures. We demonstrate its effectiveness for multi-agent trajectory prediction and social robot navigation.
arXiv Detail & Related papers (2024-01-22T18:58:22Z)
Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance. This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings. Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv Detail & Related papers (2022-07-20T13:37:57Z)
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks. To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame. Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)
Human Trajectory Forecasting in Crowds: A Deep Learning Perspective [89.4600982169]
We present an in-depth analysis of existing deep learning-based methods for modelling social interactions. We propose two knowledge-based data-driven methods to effectively capture these social interactions. We develop a large scale interaction-centric benchmark TrajNet++, a significant yet missing component in the field of human trajectory forecasting.
arXiv Detail & Related papers (2020-07-07T17:19:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.