Related papers: Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

URL: http://arxiv.org/abs/2010.07526v1
Date: Thu, 15 Oct 2020 05:08:56 GMT
Title: Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
Authors: Ana Marasovi\'c, Chandra Bhagavatula, Jae Sung Park, Ronan Le Bras, Noah A. Smith, Yejin Choi
Abstract summary: We present the first study focused on generating natural language rationales across several complex visual reasoning tasks. We present RationaleVT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs. Our experiments show that the base pretrained language model benefits from visual adaptation and that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks.
Score: 106.15931418425906
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Natural language rationales could provide intuitive, higher-level explanations that are easily understandable by humans, complementing the more broadly studied lower-level explanations based on gradients or attention weights. We present the first study focused on generating natural language rationales across several complex visual reasoning tasks: visual commonsense reasoning, visual-textual entailment, and visual question answering. The key challenge of accurate rationalization is comprehensive image understanding at all levels: not just their explicit content at the pixel level, but their contextual contents at the semantic and pragmatic levels. We present Rationale^VT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs. Our experiments show that the base pretrained language model benefits from visual adaptation and that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks.

Related papers

Capturing Visualization Design Rationale [5.051297047598238]
We present a new dataset and methodology for probing visualization design rationale through natural language.<n>We leverage a unique source of real-world visualizations and natural language narratives: literate visualization notebooks created by students as part of a data visualization course.<n>We also use large language models (LLMs) to generate and categorize question-answer-rationale triples from the narratives and articulations in the notebooks.
arXiv Detail & Related papers (2025-06-19T19:52:53Z)
Neurosymbolic Graph Enrichment for Grounded World Models [47.92947508449361]
We present a novel approach to enhance and exploit LLM reactive capability to address complex problems. We create a multimodal, knowledge-augmented formal representation of meaning that combines the strengths of large language models with structured semantic representations. By bridging the gap between unstructured language models and formal semantic structures, our method opens new avenues for tackling intricate problems in natural language understanding and reasoning.
arXiv Detail & Related papers (2024-11-19T17:23:55Z)
Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs. We propose Compositional Entailment Learning for hyperbolic vision-language models. Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z)
What Makes a Maze Look Like a Maze? [92.80800000328277]
We introduce Deep Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas--dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models.
arXiv Detail & Related papers (2024-09-12T16:41:47Z)
A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text [23.854023255928208]
We propose an end-to-end Neural Divide-and-Conquer Reasoning framework, dubbed NDCR. It contains three main components: 1) Divide: a proposition generator divides the compound proposition text into simple proposition sentences and produces their corresponding representations, 2) Conquer: a pretrained visual-linguistic interactor achieves the interaction between proposition sentences and images, and 3) Combine: a neural-symbolic reasoner combines the above reasoning states to obtain the final solution.
arXiv Detail & Related papers (2023-05-03T16:55:00Z)
Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z)
Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics. We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts. We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z)
Accessible Visualization via Natural Language Descriptions: A Four-Level Model of Semantic Content [6.434361163743876]
We introduce a conceptual model for the semantic content conveyed by natural language descriptions of visualizations. We conduct a mixed-methods evaluation with 30 blind and 90 sighted readers, and find that these reader groups differ significantly on which semantic content they rank as most useful.
arXiv Detail & Related papers (2021-10-08T23:37:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.