Things not Written in Text: Exploring Spatial Commonsense from Visual
Signals
- URL: http://arxiv.org/abs/2203.08075v1
- Date: Tue, 15 Mar 2022 17:02:30 GMT
- Title: Things not Written in Text: Exploring Spatial Commonsense from Visual
Signals
- Authors: Xiao Liu, Da Yin, Yansong Feng, Dongyan Zhao
- Abstract summary: We investigate whether models with visual signals learn more spatial commonsense than text-based models.
We propose a benchmark that focuses on the relative scales of objects, and the positional relationship between people and objects under different actions.
We find that image synthesis models are more capable of learning accurate and consistent spatial knowledge than other models.
- Score: 77.46233234061758
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Spatial commonsense, the knowledge about spatial position and relationship
between objects (like the relative size of a lion and a girl, and the position
of a boy relative to a bicycle when cycling), is an important part of
commonsense knowledge. Although pretrained language models (PLMs) succeed in
many NLP tasks, they are shown to be ineffective in spatial commonsense
reasoning. Starting from the observation that images are more likely to exhibit
spatial commonsense than texts, we explore whether models with visual signals
learn more spatial commonsense than text-based PLMs. We propose a spatial
commonsense benchmark that focuses on the relative scales of objects, and the
positional relationship between people and objects under different actions. We
probe PLMs and models with visual signals, including vision-language pretrained
models and image synthesis models, on this benchmark, and find that image
synthesis models are more capable of learning accurate and consistent spatial
knowledge than other models. The spatial knowledge from image synthesis models
also helps in natural language understanding tasks that require spatial
commonsense.
Related papers
- Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - Does Spatial Cognition Emerge in Frontier Models? [56.47912101304053]
We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models.
Results suggest that contemporary frontier models fall short of the spatial intelligence of animals.
arXiv Detail & Related papers (2024-10-09T01:41:49Z) - Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
Models [3.86170450233149]
We show that large vision-and-language models (VLMs) trained to match images with text lack fine-grained understanding of spatial relations.
We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses.
arXiv Detail & Related papers (2023-08-18T18:58:54Z) - Benchmarking Spatial Relationships in Text-to-Image Generation [102.62422723894232]
We investigate the ability of text-to-image models to generate correct spatial relationships among objects.
We present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image.
Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them.
arXiv Detail & Related papers (2022-12-20T06:03:51Z) - Language Conditioned Spatial Relation Reasoning for 3D Object Grounding [87.03299519917019]
Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations.
We propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
arXiv Detail & Related papers (2022-11-17T16:42:39Z) - Explainable Semantic Space by Grounding Language to Vision with
Cross-Modal Contrastive Learning [3.441021278275805]
We design a two-stream model for grounding language learning in vision.
The model first learns to align visual and language representations with the MS COCO dataset.
After training, the language stream of this model is a stand-alone language model capable of embedding concepts in a visually grounded semantic space.
arXiv Detail & Related papers (2021-11-13T19:54:15Z) - Pose-based Sign Language Recognition using GCN and BERT [0.0]
Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language.
recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations, and other movements.
Recent pose-based architectures for W SLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information.
We tackle the problem of W SLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion.
arXiv Detail & Related papers (2020-12-01T19:10:50Z) - Understanding Spatial Relations through Multiple Modalities [78.07328342973611]
spatial relations between objects can either be explicit -- expressed as spatial prepositions, or implicit -- expressed by spatial verbs such as moving, walking, shifting, etc.
We introduce the task of inferring implicit and explicit spatial relations between two entities in an image.
We design a model that uses both textual and visual information to predict the spatial relations, making use of both positional and size information of objects and image embeddings.
arXiv Detail & Related papers (2020-07-19T01:35:08Z) - Learning Object Placements For Relational Instructions by Hallucinating
Scene Representations [26.897316325189205]
We present a convolutional neural network for estimating pixelwise object placement probabilities for a set of spatial relations from a single input image.
Our method does not require ground truth data for the pixelwise relational probabilities or 3D models of the objects.
Results obtained using real-world data and human-robot experiments demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2020-01-23T12:58:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.