Natural Language Rationales with Full-Stack Visual Reasoning: From
Pixels to Semantic Frames to Commonsense Graphs
- URL: http://arxiv.org/abs/2010.07526v1
- Date: Thu, 15 Oct 2020 05:08:56 GMT
- Title: Natural Language Rationales with Full-Stack Visual Reasoning: From
Pixels to Semantic Frames to Commonsense Graphs
- Authors: Ana Marasovi\'c, Chandra Bhagavatula, Jae Sung Park, Ronan Le Bras,
Noah A. Smith, Yejin Choi
- Abstract summary: We present the first study focused on generating natural language rationales across several complex visual reasoning tasks.
We present RationaleVT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs.
Our experiments show that the base pretrained language model benefits from visual adaptation and that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks.
- Score: 106.15931418425906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language rationales could provide intuitive, higher-level
explanations that are easily understandable by humans, complementing the more
broadly studied lower-level explanations based on gradients or attention
weights. We present the first study focused on generating natural language
rationales across several complex visual reasoning tasks: visual commonsense
reasoning, visual-textual entailment, and visual question answering. The key
challenge of accurate rationalization is comprehensive image understanding at
all levels: not just their explicit content at the pixel level, but their
contextual contents at the semantic and pragmatic levels. We present
Rationale^VT Transformer, an integrated model that learns to generate free-text
rationales by combining pretrained language models with object recognition,
grounded visual semantic frames, and visual commonsense graphs. Our experiments
show that the base pretrained language model benefits from visual adaptation
and that free-text rationalization is a promising research direction to
complement model interpretability for complex visual-textual reasoning tasks.
Related papers
- Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images.
Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z) - Visually-Situated Natural Language Understanding with Contrastive
Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs)
Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z) - Natural Language Decompositions of Implicit Content Enable Better Text
Representations [56.85319224208865]
We introduce a method for the analysis of text that takes implicitly communicated content explicitly into account.
We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed.
Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP.
arXiv Detail & Related papers (2023-05-23T23:45:20Z) - Semantic Composition in Visually Grounded Language Models [0.0]
We show that visually-grounded language models drastically fail to represent compositional structure.
We introduce WinogroundVQA, a new compositional visual question answering benchmark.
We discuss connections of our work to neuroscience, psycholinguistics, formal semantics, and philosophy.
arXiv Detail & Related papers (2023-05-15T03:19:42Z) - A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from
Linguistically Complex Text [23.854023255928208]
We propose an end-to-end Neural Divide-and-Conquer Reasoning framework, dubbed NDCR.
It contains three main components: 1) Divide: a proposition generator divides the compound proposition text into simple proposition sentences and produces their corresponding representations, 2) Conquer: a pretrained visual-linguistic interactor achieves the interaction between proposition sentences and images, and 3) Combine: a neural-symbolic reasoner combines the above reasoning states to obtain the final solution.
arXiv Detail & Related papers (2023-05-03T16:55:00Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics.
We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts.
We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z) - Accessible Visualization via Natural Language Descriptions: A Four-Level
Model of Semantic Content [6.434361163743876]
We introduce a conceptual model for the semantic content conveyed by natural language descriptions of visualizations.
We conduct a mixed-methods evaluation with 30 blind and 90 sighted readers, and find that these reader groups differ significantly on which semantic content they rank as most useful.
arXiv Detail & Related papers (2021-10-08T23:37:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.