Related papers: Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

URL: http://arxiv.org/abs/2207.11401v1
Date: Sat, 23 Jul 2022 03:19:50 GMT
Title: Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations
Authors: Qian Yang and Yunxin Li and Baotian Hu and Lin Ma and Yuxing Ding and Min Zhang
Abstract summary: Visual Entailment with natural language explanations aims to infer the relationship between a text-image pair and generate a sentence to explain the decision-making process. Previous methods rely mainly on a pre-trained vision-language model to perform the relation inference and a language model to generate the corresponding explanation. We propose a unified Chunk-aware Alignment and Lexical Constraint based method, dubbed as CALeC.
Score: 38.50987889221086
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Visual Entailment with natural language explanations aims to infer the relationship between a text-image pair and generate a sentence to explain the decision-making process. Previous methods rely mainly on a pre-trained vision-language model to perform the relation inference and a language model to generate the corresponding explanation. However, the pre-trained vision-language models mainly build token-level alignment between text and image yet ignore the high-level semantic alignment between the phrases (chunks) and visual contents, which is critical for vision-language reasoning. Moreover, the explanation generator based only on the encoded joint representation does not explicitly consider the critical decision-making points of relation inference. Thus the generated explanations are less faithful to visual-language reasoning. To mitigate these problems, we propose a unified Chunk-aware Alignment and Lexical Constraint based method, dubbed as CALeC. It contains a Chunk-aware Semantic Interactor (arr. CSI), a relation inferrer, and a Lexical Constraint-aware Generator (arr. LeCG). Specifically, CSI exploits the sentence structure inherent in language and various image regions to build chunk-aware semantic alignment. Relation inferrer uses an attention-based reasoning network to incorporate the token-level and chunk-level vision-language representations. LeCG utilizes lexical constraints to expressly incorporate the words or chunks focused by the relation inferrer into explanation generation, improving the faithfulness and informativeness of the explanations. We conduct extensive experiments on three datasets, and experimental results indicate that CALeC significantly outperforms other competitor models on inference accuracy and quality of generated explanations.

Related papers

Causal Graphical Models for Vision-Language Compositional Understanding [36.24185263818946]
We show that our method significantly outperforms all the state-of-the-art compositional approaches by a large margin. It also improves over methods trained using much larger datasets.
arXiv Detail & Related papers (2024-12-12T15:22:03Z)
Natural Language Inference Improves Compositionality in Vision-Language Models [35.71815423077561]
We present a principled approach that generates entailments and contradictions from a given premise. CECE produces lexically diverse sentences while maintaining their core meaning. We achieve significant improvements over previous methods without requiring additional fine-tuning.
arXiv Detail & Related papers (2024-10-29T17:54:17Z)
SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer. Our method replaces original language-independent encoding with cross-modal encoding in visual analysis. Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z)
Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process. We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous. Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z)
3VL: using Trees to teach Vision & Language models compositional concepts [45.718319397947056]
We introduce the Tree-augmented Vision-Language (3VL) model architecture and training technique. We show how Anchor, a simple technique for text unification, can be employed to filter nuisance factors. We also exhibit how DiRe, which performs a differential relevancy comparison between VLM maps, enables us to generate compelling visualizations of a model's success or failure.
arXiv Detail & Related papers (2023-12-28T20:26:03Z)
Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data. CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z)
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial. We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z)
Natural Language Decompositions of Implicit Content Enable Better Text Representations [56.85319224208865]
We introduce a method for the analysis of text that takes implicitly communicated content explicitly into account. We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed. Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP.
arXiv Detail & Related papers (2023-05-23T23:45:20Z)
SrTR: Self-reasoning Transformer with Visual-linguistic Knowledge for Scene Graph Generation [12.977857322594206]
One-stage scene graph generation approaches infer the effective relation between entity pairs using sparse proposal sets and a few queries. A Self-reasoning Transformer with Visual-linguistic Knowledge (SrTR) is proposed to add flexible self-reasoning ability to the model. Inspired by the large-scale pre-training image-text foundation models, visual-linguistic prior knowledge is introduced.
arXiv Detail & Related papers (2022-12-19T09:47:27Z)
Lexically-constrained Text Generation through Commonsense Knowledge Extraction and Injection [62.071938098215085]
We focus on the Commongen benchmark, wherein the aim is to generate a plausible sentence for a given set of input concepts. We propose strategies for enhancing the semantic correctness of the generated text.
arXiv Detail & Related papers (2020-12-19T23:23:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.