Person Text-Image Matching via Text-Featur Interpretability Embedding
and External Attack Node Implantation
- URL: http://arxiv.org/abs/2211.08657v1
- Date: Wed, 16 Nov 2022 04:15:37 GMT
- Title: Person Text-Image Matching via Text-Featur Interpretability Embedding
and External Attack Node Implantation
- Authors: Fan Li, Hang Zhou, Huafeng Li, Yafei Zhang, and Zhengtao Yu
- Abstract summary: Person text-image matching aims to retrieve images of specific pedestrians using text descriptions.
The lack of interpretability of text features makes it challenging to effectively align them with their corresponding image features.
We propose a person text-image matching method by embedding text-feature interpretability and an external attack node.
- Score: 22.070781214170164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Person text-image matching, also known as textbased person search, aims to
retrieve images of specific pedestrians using text descriptions. Although
person text-image matching has made great research progress, existing methods
still face two challenges. First, the lack of interpretability of text features
makes it challenging to effectively align them with their corresponding image
features. Second, the same pedestrian image often corresponds to multiple
different text descriptions, and a single text description can correspond to
multiple different images of the same identity. The diversity of text
descriptions and images makes it difficult for a network to extract robust
features that match the two modalities. To address these problems, we propose a
person text-image matching method by embedding text-feature interpretability
and an external attack node. Specifically, we improve the interpretability of
text features by providing them with consistent semantic information with image
features to achieve the alignment of text and describe image region features.To
address the challenges posed by the diversity of text and the corresponding
person images, we treat the variation caused by diversity to features as caused
by perturbation information and propose a novel adversarial attack and defense
method to solve it. In the model design, graph convolution is used as the basic
framework for feature representation and the adversarial attacks caused by text
and image diversity on feature extraction is simulated by implanting an
additional attack node in the graph convolution layer to improve the robustness
of the model against text and image diversity. Extensive experiments
demonstrate the effectiveness and superiority of text-pedestrian image matching
over existing methods. The source code of the method is published at
Related papers
- Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion
Models [68.47333676663312]
We show a simple modification of classifier-free guidance can help disentangle image factors in text-to-image models.
The key idea of our method, Contrastive Guidance, is to characterize an intended factor with two prompts that differ in minimal tokens.
We illustrate whose benefits in three scenarios: (1) to guide domain-specific diffusion models trained on an object class, (2) to gain continuous, rig-like controls for text-to-image generation, and (3) to improve the performance of zero-shot image editors.
arXiv Detail & Related papers (2024-02-21T03:01:17Z) - Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z) - DreamInpainter: Text-Guided Subject-Driven Image Inpainting with
Diffusion Models [37.133727797607676]
This study introduces Text-Guided Subject-Driven Image Inpainting.
We compute dense subject features to ensure accurate subject replication.
We employ a discriminative token selection module to eliminate redundant subject details.
arXiv Detail & Related papers (2023-12-05T22:23:19Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - Unleashing the Imagination of Text: A Novel Framework for Text-to-image
Person Retrieval via Exploring the Power of Words [0.951828574518325]
We propose a novel framework to explore the power of words in sentences.
The framework employs the pre-trained full CLIP model as a dual encoder for the images and texts.
We introduce a cross-modal triplet loss tailored for handling hard samples, enhancing the model's ability to distinguish minor differences.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for
Controllable Text-Driven Person Image Generation [73.3790833537313]
Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on.
We propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation.
arXiv Detail & Related papers (2022-11-11T14:30:34Z) - Learning Semantic-Aligned Feature Representation for Text-based Person
Search [8.56017285139081]
We propose a semantic-aligned embedding method for text-based person search.
The feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features.
Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.
arXiv Detail & Related papers (2021-12-13T14:54:38Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z) - Image-to-Image Translation with Text Guidance [139.41321867508722]
The goal of this paper is to embed controllable factors, i.e., natural language descriptions, into image-to-image translation with generative adversarial networks.
We propose four key components: (1) the implementation of part-of-speech tagging to filter out non-semantic words in the given description, (2) the adoption of an affine combination module to effectively fuse different modality text and image features, and (3) a novel refined multi-stage architecture to strengthen the differential ability of discriminators and the rectification ability of generators.
arXiv Detail & Related papers (2020-02-12T21:09:15Z) - STEFANN: Scene Text Editor using Font Adaptive Neural Network [18.79337509555511]
We propose a method to modify text in an image at character-level.
We propose two different neural network architectures - (a) FANnet to achieve structural consistency with source font and (b) Colornet to preserve source color.
Our method works as a unified platform for modifying text in images.
arXiv Detail & Related papers (2019-03-04T11:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.