NEUCORE: Neural Concept Reasoning for Composed Image Retrieval
- URL: http://arxiv.org/abs/2310.01358v1
- Date: Mon, 2 Oct 2023 17:21:25 GMT
- Title: NEUCORE: Neural Concept Reasoning for Composed Image Retrieval
- Authors: Shu Zhao, Huijuan Xu
- Abstract summary: We propose a NEUral COncept REasoning model which incorporates multi-modal concept alignment and progressive multimodal fusion over aligned concepts.
Our proposed approach is evaluated on three datasets and achieves state-of-the-art results.
- Score: 16.08214739525615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Composed image retrieval which combines a reference image and a text modifier
to identify the desired target image is a challenging task, and requires the
model to comprehend both vision and language modalities and their interactions.
Existing approaches focus on holistic multi-modal interaction modeling, and
ignore the composed and complimentary property between the reference image and
text modifier. In order to better utilize the complementarity of multi-modal
inputs for effective information fusion and retrieval, we move the multi-modal
understanding to fine-granularity at concept-level, and learn the multi-modal
concept alignment to identify the visual location in reference or target images
corresponding to text modifier. Toward the end, we propose a NEUral COncept
REasoning (NEUCORE) model which incorporates multi-modal concept alignment and
progressive multimodal fusion over aligned concepts. Specifically, considering
that text modifier may refer to semantic concepts not existing in the reference
image and requiring to be added into the target image, we learn the multi-modal
concept alignment between the text modifier and the concatenation of reference
and target images, under multiple-instance learning framework with image and
sentence level weak supervision. Furthermore, based on aligned concepts, to
form discriminative fusion features of the input modalities for accurate target
image retrieval, we propose a progressive fusion strategy with unified
execution architecture instantiated by the attended language semantic concepts.
Our proposed approach is evaluated on three datasets and achieves
state-of-the-art results.
Related papers
- Shapley Value-based Contrastive Alignment for Multimodal Information Extraction [17.04865437165252]
We introduce a new paradigm of Image-Context-Text interaction.
We propose a novel Shapley Value-based Contrastive Alignment (Shap-CA) method.
Our method significantly outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2024-07-25T08:15:43Z) - Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation [8.383431263616105]
We introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles.
Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information.
We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence.
arXiv Detail & Related papers (2024-05-18T07:21:12Z) - Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models [85.14042557052352]
We introduce Concept Weaver, a method for composing customized text-to-image diffusion models at inference time.
We show that Concept Weaver can generate multiple custom concepts with higher identity fidelity compared to alternative approaches.
arXiv Detail & Related papers (2024-04-05T06:41:27Z) - Textual Localization: Decomposing Multi-concept Images for
Subject-Driven Text-to-Image Generation [5.107886283951882]
We introduce a localized text-to-image model to handle multi-concept input images.
Our method incorporates a novel cross-attention guidance to decompose multiple concepts.
Notably, our method generates cross-attention maps consistent with the target concept in the generated images.
arXiv Detail & Related papers (2024-02-15T14:19:42Z) - M^2ConceptBase: A Fine-Grained Aligned Concept-Centric Multimodal Knowledge Base [61.53959791360333]
We introduce M2ConceptBase, the first concept-centric multimodal knowledge base (MMKB)
We propose a context-aware approach to align concept-image and concept-description pairs using context information from image-text datasets.
Human studies confirm more than 95% alignment accuracy, underscoring its quality.
arXiv Detail & Related papers (2023-12-16T11:06:11Z) - Cross-Modal Concept Learning and Inference for Vision-Language Models [31.463771883036607]
In existing fine-tuning methods, the class-specific text description is matched against the whole image.
We develop a new method called cross-model concept learning and inference (CCLI)
Our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts.
arXiv Detail & Related papers (2023-07-28T10:26:28Z) - Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition.
We propose augmenting the input image with masks that indicate the presence of target concepts.
We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.