Related papers: Image Search with Text Feedback by Additive Attention Compositional Learning

Image Search with Text Feedback by Additive Attention Compositional Learning

URL: http://arxiv.org/abs/2203.03809v1
Date: Tue, 8 Mar 2022 02:03:49 GMT
Title: Image Search with Text Feedback by Additive Attention Compositional Learning
Authors: Yuxin Tian, Shawn Newsam, Kofi Boakye
Abstract summary: We propose an image-text composition module based on additive attention that can be seamlessly plugged into deep neural networks. AACL is evaluated on three large-scale datasets (FashionIQ, Fashion200k, and Shopping100k)
Score: 1.4395184780210915
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective image retrieval with text feedback stands to impact a range of real-world applications, such as e-commerce. Given a source image and text feedback that describes the desired modifications to that image, the goal is to retrieve the target images that resemble the source yet satisfy the given modifications by composing a multi-modal (image-text) query. We propose a novel solution to this problem, Additive Attention Compositional Learning (AACL), that uses a multi-modal transformer-based architecture and effectively models the image-text contexts. Specifically, we propose a novel image-text composition module based on additive attention that can be seamlessly plugged into deep neural networks. We also introduce a new challenging benchmark derived from the Shopping100k dataset. AACL is evaluated on three large-scale datasets (FashionIQ, Fashion200k, and Shopping100k), each with strong baselines. Extensive experiments show that AACL achieves new state-of-the-art results on all three datasets.

Related papers

Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark. A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z)
SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval [7.248145893361865]
Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. In this work, we propose SCOT, a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network.
arXiv Detail & Related papers (2025-01-12T07:23:49Z)
Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval [10.202562518113677]
We propose an approach called Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval. Our key innovation lies in the usage of text data as auxiliary information for images, thus leveraging the inherent zero-shot generalization ability that language offers.
arXiv Detail & Related papers (2024-07-01T05:32:06Z)
TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset. It contains 39,153 text-rich images, captions, and 102,437 questions. We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z)
StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond [68.0107158115377]
We have crafted an efficient vision-language model, StrucTexTv3, tailored to tackle various intelligent tasks for text-rich images. We enhance the perception and comprehension abilities of StrucTexTv3 through instruction learning. Our method achieved SOTA results in text-rich image perception tasks, and significantly improved performance in comprehension tasks.
arXiv Detail & Related papers (2024-05-31T16:55:04Z)
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark. FineMatch focuses on text and image mismatch detection and correction. We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z)
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities. We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding. Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z)
Transformer based Multitask Learning for Image Captioning and Object Detection [13.340784876489927]
This work introduces a novel multitask learning framework that combines image captioning and object detection into a joint model. We propose TICOD, Transformer-based Image Captioning and Object detection model for jointly training both tasks. Our model outperforms the baselines from image captioning literature by achieving a 3.65% improvement in BERTScore.
arXiv Detail & Related papers (2024-03-10T19:31:13Z)
Benchmarking Robustness of Text-Image Composed Retrieval [46.98557472744255]
Text-image composed retrieval aims to retrieve the target image through the composed query. It has recently attracted attention due to its ability to leverage both information-rich images and concise language. However, the robustness of these approaches against real-world corruptions or further text understanding has never been studied.
arXiv Detail & Related papers (2023-11-24T20:16:38Z)
BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text. We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z)
RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network [19.017377597937617]
We study the compositional learning of images and texts for image retrieval. We introduce a novel method that combines the graph convolutional network (GCN) with existing composition methods.
arXiv Detail & Related papers (2021-04-07T09:41:52Z)
SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval [15.074592583852167]
We focus on the task of text-conditioned image retrieval that utilizes support text feedback alongside a reference image to retrieve images. We propose a novel framework SAC which resolves the above in two major steps: "where to see" (Semantic Feature Attention) and "how to change" We show how our architecture streamlines the generation of text-aware image features by removing the need for various modules required by other state-of-art techniques.
arXiv Detail & Related papers (2020-09-03T06:55:23Z)
Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects. The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image. We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.