A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from
Linguistically Complex Text
- URL: http://arxiv.org/abs/2305.02265v2
- Date: Fri, 5 May 2023 17:41:57 GMT
- Title: A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from
Linguistically Complex Text
- Authors: Yunxin Li, Baotian Hu, Yuxin Ding, Lin Ma, and Min Zhang
- Abstract summary: We propose an end-to-end Neural Divide-and-Conquer Reasoning framework, dubbed NDCR.
It contains three main components: 1) Divide: a proposition generator divides the compound proposition text into simple proposition sentences and produces their corresponding representations, 2) Conquer: a pretrained visual-linguistic interactor achieves the interaction between proposition sentences and images, and 3) Combine: a neural-symbolic reasoner combines the above reasoning states to obtain the final solution.
- Score: 23.854023255928208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained Vision-Language Models (VLMs) have achieved remarkable performance
in image retrieval from text. However, their performance drops drastically when
confronted with linguistically complex texts that they struggle to comprehend.
Inspired by the Divide-and-Conquer algorithm and dual-process theory, in this
paper, we regard linguistically complex texts as compound proposition texts
composed of multiple simple proposition sentences and propose an end-to-end
Neural Divide-and-Conquer Reasoning framework, dubbed NDCR. It contains three
main components: 1) Divide: a proposition generator divides the compound
proposition text into simple proposition sentences and produces their
corresponding representations, 2) Conquer: a pretrained VLMs-based
visual-linguistic interactor achieves the interaction between decomposed
proposition sentences and images, 3) Combine: a neural-symbolic reasoner
combines the above reasoning states to obtain the final solution via a neural
logic reasoning approach. According to the dual-process theory, the
visual-linguistic interactor and neural-symbolic reasoner could be regarded as
analogical reasoning System 1 and logical reasoning System 2. We conduct
extensive experiments on a challenging image retrieval from contextual
descriptions data set. Experimental results and analyses indicate NDCR
significantly improves performance in the complex image-text reasoning problem.
Code link: https://github.com/YunxinLi/NDCR.
Related papers
- LOGICSEG: Parsing Visual Semantics with Neural Logic Learning and
Reasoning [73.98142349171552]
LOGICSEG is a holistic visual semantic that integrates neural inductive learning and logic reasoning with both rich data and symbolic knowledge.
During fuzzy logic-based continuous relaxation, logical formulae are grounded onto data and neural computational graphs, hence enabling logic-induced network training.
These designs together make LOGICSEG a general and compact neural-logic machine that is readily integrated into existing segmentation models.
arXiv Detail & Related papers (2023-09-24T05:43:19Z) - Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal
Contrastive Training [33.78990448307792]
Image-text retrieval is a central problem for understanding the semantic relationship between vision and language.
Previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words.
In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework.
arXiv Detail & Related papers (2023-06-15T00:19:13Z) - See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Multimodal Neural Machine Translation with Search Engine Based Image
Retrieval [4.662583832063716]
We propose an open-vocabulary image retrieval method to collect descriptive images for bilingual parallel corpus.
Our proposed method achieves significant improvements over strong baselines.
arXiv Detail & Related papers (2022-07-26T08:42:06Z) - Primitive Representation Learning for Scene Text Recognition [7.818765015637802]
We propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images.
A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding.
We also propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods.
arXiv Detail & Related papers (2021-05-10T11:54:49Z) - Natural Language Rationales with Full-Stack Visual Reasoning: From
Pixels to Semantic Frames to Commonsense Graphs [106.15931418425906]
We present the first study focused on generating natural language rationales across several complex visual reasoning tasks.
We present RationaleVT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs.
Our experiments show that the base pretrained language model benefits from visual adaptation and that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks.
arXiv Detail & Related papers (2020-10-15T05:08:56Z) - A Novel Attention-based Aggregation Function to Combine Vision and
Language [55.7633883960205]
We propose a novel fully-attentive reduction method for vision and language.
Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention.
We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices.
arXiv Detail & Related papers (2020-04-27T18:09:46Z) - Image-to-Image Translation with Text Guidance [139.41321867508722]
The goal of this paper is to embed controllable factors, i.e., natural language descriptions, into image-to-image translation with generative adversarial networks.
We propose four key components: (1) the implementation of part-of-speech tagging to filter out non-semantic words in the given description, (2) the adoption of an affine combination module to effectively fuse different modality text and image features, and (3) a novel refined multi-stage architecture to strengthen the differential ability of discriminators and the rectification ability of generators.
arXiv Detail & Related papers (2020-02-12T21:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.