Find Someone Who: Visual Commonsense Understanding in Human-Centric
Grounding
- URL: http://arxiv.org/abs/2212.06971v1
- Date: Wed, 14 Dec 2022 01:37:16 GMT
- Title: Find Someone Who: Visual Commonsense Understanding in Human-Centric
Grounding
- Authors: Haoxuan You, Rui Sun, Zhecan Wang, Kai-Wei Chang, Shih-Fu Chang
- Abstract summary: We present a new commonsense task, Human-centric Commonsense Grounding.
It tests the models' ability to ground individuals given the context descriptions about what happened before.
We set up a context-object-aware method as a strong baseline that outperforms previous pre-trained and non-pretrained models.
- Score: 87.39245901710079
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: From a visual scene containing multiple people, human is able to distinguish
each individual given the context descriptions about what happened before,
their mental/physical states or intentions, etc. Above ability heavily relies
on human-centric commonsense knowledge and reasoning. For example, if asked to
identify the "person who needs healing" in an image, we need to first know that
they usually have injuries or suffering expressions, then find the
corresponding visual clues before finally grounding the person. We present a
new commonsense task, Human-centric Commonsense Grounding, that tests the
models' ability to ground individuals given the context descriptions about what
happened before, and their mental/physical states or intentions. We further
create a benchmark, HumanCog, a dataset with 130k grounded commonsensical
descriptions annotated on 67k images, covering diverse types of commonsense and
visual scenes. We set up a context-object-aware method as a strong baseline
that outperforms previous pre-trained and non-pretrained models. Further
analysis demonstrates that rich visual commonsense and powerful integration of
multi-modal commonsense are essential, which sheds light on future works. Data
and code will be available https://github.com/Hxyou/HumanCog.
Related papers
- CapHuman: Capture Your Moments in Parallel Universes [60.06408546134581]
We present a new framework named CapHuman.
CapHuman encodes identity features and then learns to align them into the latent space.
We introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner.
arXiv Detail & Related papers (2024-02-01T14:41:59Z) - A natural language processing-based approach: mapping human perception
by understanding deep semantic features in street view images [2.5880672192855414]
We propose a new framework based on a pre-train natural language model to understand the relationship between human perception and a scene.
Our results show that human perception scoring by deep semantic features performed better than previous studies by machine learning methods with shallow features.
arXiv Detail & Related papers (2023-11-29T05:00:43Z) - HumanBench: Towards General Human-centric Perception with Projector
Assisted Pretraining [75.1086193340286]
It is desirable to have a general pretrain model for versatile human-centric downstream tasks.
We propose a textbfHumanBench based on existing datasets to evaluate on the common ground the generalization abilities of different pretraining methods.
Our PATH achieves new state-of-the-art results on 17 downstream datasets and on-par results on the other 2 datasets.
arXiv Detail & Related papers (2023-03-10T02:57:07Z) - MECCANO: A Multimodal Egocentric Dataset for Humans Behavior
Understanding in the Industrial-like Domain [23.598727613908853]
We present MECCANO, a dataset of egocentric videos to study humans behavior understanding in industrial-like settings.
The multimodality is characterized by the presence of gaze signals, depth maps and RGB videos acquired simultaneously with a custom headset.
The dataset has been explicitly labeled for fundamental tasks in the context of human behavior understanding from a first person view.
arXiv Detail & Related papers (2022-09-19T00:52:42Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z) - Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and
Reasoning [78.13740873213223]
Bongard problems (BPs) were introduced as an inspirational challenge for visual cognition in intelligent systems.
We propose a new benchmark Bongard-LOGO for human-level concept learning and reasoning.
arXiv Detail & Related papers (2020-10-02T03:19:46Z) - VisualCOMET: Reasoning about the Dynamic Context of a Still Image [97.20800299330078]
We propose VisualComet, a framework for visual commonsense reasoning.
VisualComet predicts events that might have happened before, events that might happen next, and the intents of the people at present.
We introduce the first large-scale repository of Visual Commonsense Graphs.
arXiv Detail & Related papers (2020-04-22T19:02:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.