Natural Language Rationales with Full-Stack Visual Reasoning: From
Pixels to Semantic Frames to Commonsense Graphs
- URL: http://arxiv.org/abs/2010.07526v1
- Date: Thu, 15 Oct 2020 05:08:56 GMT
- Title: Natural Language Rationales with Full-Stack Visual Reasoning: From
Pixels to Semantic Frames to Commonsense Graphs
- Authors: Ana Marasovi\'c, Chandra Bhagavatula, Jae Sung Park, Ronan Le Bras,
Noah A. Smith, Yejin Choi
- Abstract summary: We present the first study focused on generating natural language rationales across several complex visual reasoning tasks.
We present RationaleVT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs.
Our experiments show that the base pretrained language model benefits from visual adaptation and that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks.
- Score: 106.15931418425906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language rationales could provide intuitive, higher-level
explanations that are easily understandable by humans, complementing the more
broadly studied lower-level explanations based on gradients or attention
weights. We present the first study focused on generating natural language
rationales across several complex visual reasoning tasks: visual commonsense
reasoning, visual-textual entailment, and visual question answering. The key
challenge of accurate rationalization is comprehensive image understanding at
all levels: not just their explicit content at the pixel level, but their
contextual contents at the semantic and pragmatic levels. We present
Rationale^VT Transformer, an integrated model that learns to generate free-text
rationales by combining pretrained language models with object recognition,
grounded visual semantic frames, and visual commonsense graphs. Our experiments
show that the base pretrained language model benefits from visual adaptation
and that free-text rationalization is a promising research direction to
complement model interpretability for complex visual-textual reasoning tasks.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - What Makes a Maze Look Like a Maze? [92.80800000328277]
We introduce Deep Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning.
At the core of DSG are schemas--dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols.
We show that DSG significantly improves the abstract visual reasoning performance of vision-language models.
arXiv Detail & Related papers (2024-09-12T16:41:47Z) - Visually-Situated Natural Language Understanding with Contrastive
Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs)
Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z) - Natural Language Decompositions of Implicit Content Enable Better Text
Representations [56.85319224208865]
We introduce a method for the analysis of text that takes implicitly communicated content explicitly into account.
We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed.
Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP.
arXiv Detail & Related papers (2023-05-23T23:45:20Z) - A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from
Linguistically Complex Text [23.854023255928208]
We propose an end-to-end Neural Divide-and-Conquer Reasoning framework, dubbed NDCR.
It contains three main components: 1) Divide: a proposition generator divides the compound proposition text into simple proposition sentences and produces their corresponding representations, 2) Conquer: a pretrained visual-linguistic interactor achieves the interaction between proposition sentences and images, and 3) Combine: a neural-symbolic reasoner combines the above reasoning states to obtain the final solution.
arXiv Detail & Related papers (2023-05-03T16:55:00Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics.
We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts.
We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z) - Accessible Visualization via Natural Language Descriptions: A Four-Level
Model of Semantic Content [6.434361163743876]
We introduce a conceptual model for the semantic content conveyed by natural language descriptions of visualizations.
We conduct a mixed-methods evaluation with 30 blind and 90 sighted readers, and find that these reader groups differ significantly on which semantic content they rank as most useful.
arXiv Detail & Related papers (2021-10-08T23:37:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.