VisualCOMET: Reasoning about the Dynamic Context of a Still Image
- URL: http://arxiv.org/abs/2004.10796v3
- Date: Sat, 1 Aug 2020 13:11:10 GMT
- Title: VisualCOMET: Reasoning about the Dynamic Context of a Still Image
- Authors: Jae Sung Park, Chandra Bhagavatula, Roozbeh Mottaghi, Ali Farhadi,
Yejin Choi
- Abstract summary: We propose VisualComet, a framework for visual commonsense reasoning.
VisualComet predicts events that might have happened before, events that might happen next, and the intents of the people at present.
We introduce the first large-scale repository of Visual Commonsense Graphs.
- Score: 97.20800299330078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Even from a single frame of a still image, people can reason about the
dynamic story of the image before, after, and beyond the frame. For example,
given an image of a man struggling to stay afloat in water, we can reason that
the man fell into the water sometime in the past, the intent of that man at the
moment is to stay alive, and he will need help in the near future or else he
will get washed away. We propose VisualComet, the novel framework of visual
commonsense reasoning tasks to predict events that might have happened before,
events that might happen next, and the intents of the people at present. To
support research toward visual commonsense reasoning, we introduce the first
large-scale repository of Visual Commonsense Graphs that consists of over 1.4
million textual descriptions of visual commonsense inferences carefully
annotated over a diverse set of 60,000 images, each paired with short video
summaries of before and after. In addition, we provide person-grounding (i.e.,
co-reference links) between people appearing in the image and people mentioned
in the textual commonsense descriptions, allowing for tighter integration
between images and text. We establish strong baseline performances on this task
and demonstrate that integration between visual and textual commonsense
reasoning is the key and wins over non-integrative alternatives.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - SCO-VIST: Social Interaction Commonsense Knowledge-based Visual
Storytelling [12.560014305032437]
This paper introduces SCO-VIST, a framework representing the image sequence as a graph with objects and relations.
SCO-VIST then takes this graph representing plot points and creates bridges between plot points with semantic and occurrence-based edge weights.
This weighted story graph produces the storyline in a sequence of events using Floyd-Warshall's algorithm.
arXiv Detail & Related papers (2024-02-01T04:09:17Z) - Contextually-rich human affect perception using multimodal scene
information [36.042369831043686]
We leverage pretrained vision-language (VLN) models to extract descriptions of foreground context from images.
We propose a multimodal context fusion (MCF) module to combine foreground cues with the visual scene and person-based contextual information for emotion prediction.
We show the effectiveness of our proposed modular design on two datasets associated with natural scenes and TV shows.
arXiv Detail & Related papers (2023-03-13T07:46:41Z) - Find Someone Who: Visual Commonsense Understanding in Human-Centric
Grounding [87.39245901710079]
We present a new commonsense task, Human-centric Commonsense Grounding.
It tests the models' ability to ground individuals given the context descriptions about what happened before.
We set up a context-object-aware method as a strong baseline that outperforms previous pre-trained and non-pretrained models.
arXiv Detail & Related papers (2022-12-14T01:37:16Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Topic Scene Graph Generation by Attention Distillation from Caption [1.181694273002388]
A scene graph is not as practical as expected unless it can reduce the trivial contents and noises.
We let the scene graph borrow the ability from the image caption so that it can be a specialist on the basis of remaining all-around.
Experiments show that attention distillation brings significant improvements in mining important relationships without strong supervision.
arXiv Detail & Related papers (2021-10-12T04:26:12Z) - Enhancing Social Relation Inference with Concise Interaction Graph and
Discriminative Scene Representation [56.25878966006678]
We propose an approach of textbfPRactical textbfInference in textbfSocial rtextbfElation (PRISE)
It concisely learns interactive features of persons and discriminative features of holistic scenes.
PRISE achieves 6.8$%$ improvement for domain classification in PIPA dataset.
arXiv Detail & Related papers (2021-07-30T04:20:13Z) - Sketching Image Gist: Human-Mimetic Hierarchical Scene Graph Generation [98.34909905511061]
We argue that a desirable scene graph should be hierarchically constructed, and introduce a new scheme for modeling scene graph.
To generate a scene graph based on HET, we parse HET with a Hybrid Long Short-Term Memory (Hybrid-LSTM) which specifically encodes hierarchy and siblings context.
To further prioritize key relations in the scene graph, we devise a Relation Ranking Module (RRM) to dynamically adjust their rankings.
arXiv Detail & Related papers (2020-07-17T05:12:13Z) - Visual Relationship Detection using Scene Graphs: A Survey [1.3505077405741583]
A Scene Graph is a technique to better represent a scene and the various relationships present in it.
We present a detailed survey on the various techniques for scene graph generation, their efficacy to represent visual relationships and how it has been used to solve various downstream tasks.
arXiv Detail & Related papers (2020-05-16T17:06:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.