Grounded Situation Recognition
- URL: http://arxiv.org/abs/2003.12058v1
- Date: Thu, 26 Mar 2020 17:57:52 GMT
- Title: Grounded Situation Recognition
- Authors: Sarah Pratt, Mark Yatskar, Luca Weihs, Ali Farhadi, Aniruddha Kembhavi
- Abstract summary: We introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images.
GSR presents important technical challenges: identifying semantic saliency, categorizing and localizing a large and diverse set of entities.
We show initial findings on three exciting future directions enabled by our models: conditional querying, visual chaining, and grounded semantic aware image retrieval.
- Score: 56.18102368133022
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Grounded Situation Recognition (GSR), a task that requires
producing structured semantic summaries of images describing: the primary
activity, entities engaged in the activity with their roles (e.g. agent, tool),
and bounding-box groundings of entities. GSR presents important technical
challenges: identifying semantic saliency, categorizing and localizing a large
and diverse set of entities, overcoming semantic sparsity, and disambiguating
roles. Moreover, unlike in captioning, GSR is straightforward to evaluate. To
study this new task we create the Situations With Groundings (SWiG) dataset
which adds 278,336 bounding-box groundings to the 11,538 entity classes in the
imsitu dataset. We propose a Joint Situation Localizer and find that jointly
predicting situations and groundings with end-to-end training handily
outperforms independent training on the entire grounding metric suite with
relative gains between 8% and 32%. Finally, we show initial findings on three
exciting future directions enabled by our models: conditional querying, visual
chaining, and grounded semantic aware image retrieval. Code and data available
at https://prior.allenai.org/projects/gsr.
Related papers
- S3PT: Scene Semantics and Structure Guided Clustering to Boost Self-Supervised Pre-Training for Autonomous Driving [12.406655155106424]
We propose S3PT a novel scene semantics and structure guided clustering to provide more scene-consistent objectives for self-supervised training.
Our contributions are threefold: First, we incorporate semantic distribution consistent clustering to encourage better representation of rare classes such as motorcycles or animals.
Second, we introduce object diversity consistent spatial clustering, to handle imbalanced and diverse object sizes, ranging from large background areas to small objects such as pedestrians and traffic signs.
Third, we propose a depth-guided spatial clustering to regularize learning based on geometric information of the scene, thus further refining region separation on the feature level.
arXiv Detail & Related papers (2024-10-30T15:00:06Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - Frequency-based Matcher for Long-tailed Semantic Segmentation [22.199174076366003]
We focus on a relatively under-explored task setting, long-tailed semantic segmentation (LTSS)
We propose a dual-metric evaluation system and construct the LTSS benchmark to demonstrate the performance of semantic segmentation methods and long-tailed solutions.
We also propose a transformer-based algorithm to improve LTSS, frequency-based matcher, which solves the oversuppression problem by one-to-many matching.
arXiv Detail & Related papers (2024-06-06T09:57:56Z) - Evaluating the Efficacy of Cut-and-Paste Data Augmentation in Semantic Segmentation for Satellite Imagery [4.499833362998487]
This study explores the effectiveness of a Cut-and-Paste augmentation technique for semantic segmentation in satellite images.
We adapt this augmentation, which usually requires labeled instances, to the case of semantic segmentation.
Using the DynamicEarthNet dataset and a U-Net model for evaluation, we found that this augmentation significantly enhances the mIoU score on the test set from 37.9 to 44.1.
arXiv Detail & Related papers (2024-04-08T17:18:30Z) - Leveraging sparse and shared feature activations for disentangled
representation learning [112.22699167017471]
We propose to leverage knowledge extracted from a diversified set of supervised tasks to learn a common disentangled representation.
We validate our approach on six real world distribution shift benchmarks, and different data modalities.
arXiv Detail & Related papers (2023-04-17T01:33:24Z) - INoD: Injected Noise Discriminator for Self-Supervised Representation
Learning in Agricultural Fields [6.891600948991265]
We propose an Injected Noise Discriminator (INoD) which exploits principles of feature replacement and dataset discrimination for self-supervised representation learning.
INoD interleaves feature maps from two disjoint datasets during their convolutional encoding and predicts the dataset affiliation of the resultant feature map as a pretext task.
Our approach enables the network to learn unequivocal representations of objects seen in one dataset while observing them in conjunction with similar features from the disjoint dataset.
arXiv Detail & Related papers (2023-03-31T14:46:31Z) - Navya3DSeg -- Navya 3D Semantic Segmentation Dataset & split generation
for autonomous vehicles [63.20765930558542]
3D semantic data are useful for core perception tasks such as obstacle detection and ego-vehicle localization.
We propose a new dataset, Navya 3D (Navya3DSeg), with a diverse label space corresponding to a large scale production grade operational domain.
It contains 23 labeled sequences and 25 supplementary sequences without labels, designed to explore self-supervised and semi-supervised semantic segmentation benchmarks on point clouds.
arXiv Detail & Related papers (2023-02-16T13:41:19Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Conditioning Covert Geo-Location (CGL) Detection on Semantic Class
Information [5.660207256468971]
Task for identification of potential hideouts termed Covert Geo-Location (CCGL) detection was proposed by Saha et al.
No attempts were made to utilize semantic class information, which is crucial for obscured detection.
In this paper, we propose a multitask-learning-based approach to achieve 2 goals - i) extraction of features having semantic class information; ii) robust training of the common encoder, exploiting large standard annotated datasets as training set for the auxiliary task (semantic segmentation).
arXiv Detail & Related papers (2022-11-27T07:21:59Z) - PGL: Prior-Guided Local Self-supervised Learning for 3D Medical Image
Segmentation [87.50205728818601]
We propose a PriorGuided Local (PGL) self-supervised model that learns the region-wise local consistency in the latent feature space.
Our PGL model learns the distinctive representations of local regions, and hence is able to retain structural information.
arXiv Detail & Related papers (2020-11-25T11:03:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.