Related papers: A Convolutional Baseline for Person Re-Identification Using Vision and Language Descriptions

A Convolutional Baseline for Person Re-Identification Using Vision and Language Descriptions

URL: http://arxiv.org/abs/2003.00808v1
Date: Thu, 20 Feb 2020 10:12:02 GMT
Title: A Convolutional Baseline for Person Re-Identification Using Vision and Language Descriptions
Authors: Ammarah Farooq, Muhammad Awais, Fei Yan, Josef Kittler, Ali Akbari, and Syed Safwan Khalid
Abstract summary: In real-world surveillance scenarios, frequently no visual information will be available about the queried person. A two stream deep convolutional neural network framework supervised by cross entropy loss is presented. The learnt visual representations are more robust and perform 22% better during retrieval as compared to a single modality system.
Score: 24.794592610444514
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Classical person re-identification approaches assume that a person of interest has appeared across different cameras and can be queried by one of the existing images. However, in real-world surveillance scenarios, frequently no visual information will be available about the queried person. In such scenarios, a natural language description of the person by a witness will provide the only source of information for retrieval. In this work, person re-identification using both vision and language information is addressed under all possible gallery and query scenarios. A two stream deep convolutional neural network framework supervised by cross entropy loss is presented. The weights connecting the second last layer to the last layer with class probabilities, i.e., logits of softmax layer are shared in both networks. Canonical Correlation Analysis is performed to enhance the correlation between the two modalities in a joint latent embedding space. To investigate the benefits of the proposed approach, a new testing protocol under a multi modal ReID setting is proposed for the test split of the CUHK-PEDES and CUHK-SYSU benchmarks. The experimental results verify the merits of the proposed system. The learnt visual representations are more robust and perform 22\% better during retrieval as compared to a single modality system. The retrieval with a multi modal query greatly enhances the re-identification capability of the system quantitatively as well as qualitatively.

Related papers

Visualized Text-to-Image Retrieval [55.178938325324864]
We propose Visualize-then-Retrieve (VisRet), a new paradigm for Text-to-Image (T2I) retrieval.<n>VisRet first projects textual queries into the image modality via T2I generation.<n>It then performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features.
arXiv Detail & Related papers (2025-05-26T17:59:33Z)
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning [9.786907179872815]
The potential of vision and language remains underexplored in face forgery detection. There is a need for a methodology that converts face forgery detection to a Visual Question Answering (VQA) task. We propose a multi-staged approach that diverges from the traditional binary decision paradigm to address this gap.
arXiv Detail & Related papers (2024-10-01T08:16:40Z)
ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval. ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
arXiv Detail & Related papers (2024-06-25T12:47:04Z)
Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models [44.60439935450292]
We propose a novel method for zero-shot visual recognition: RECODE. It decomposes each predicate category into subject, object, and spatial components. Different visual cues enhance the discriminability of similar relation categories from different perspectives.
arXiv Detail & Related papers (2023-05-21T14:40:48Z)
Dynamic Prototype Mask for Occluded Person Re-Identification [88.7782299372656]
Existing methods mainly address this issue by employing body clues provided by an extra network to distinguish the visible part. We propose a novel Dynamic Prototype Mask (DPM) based on two self-evident prior knowledge. Under this condition, the occluded representation could be well aligned in a selected subspace spontaneously.
arXiv Detail & Related papers (2022-07-19T03:31:13Z)
UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query. Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z)
Unsupervised Contrastive Hashing for Cross-Modal Retrieval in Remote Sensing [1.6758573326215689]
Cross-modal text-image retrieval has attracted great attention in remote sensing. We introduce a novel unsupervised cross-modal contrastive hashing (DUCH) method for text-image retrieval in RS. Experimental results show that the proposed DUCH outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-04-19T07:25:25Z)
Global-Local Context Network for Person Search [125.51080862575326]
Person search aims to jointly localize and identify a query person from natural, uncropped images. We exploit rich context information globally and locally surrounding the target person, which we refer to scene and group context, respectively. We propose a unified global-local context network (GLCNet) with the intuitive aim of feature enhancement.
arXiv Detail & Related papers (2021-12-05T07:38:53Z)
Distribution Alignment: A Unified Framework for Long-tail Visual Recognition [52.36728157779307]
We propose a unified distribution alignment strategy for long-tail visual recognition. We then introduce a generalized re-weight method in the two-stage learning to balance the class prior. Our approach achieves the state-of-the-art results across all four recognition tasks with a simple and unified framework.
arXiv Detail & Related papers (2021-03-30T14:09:53Z)
Gait Recognition using Multi-Scale Partial Representation Transformation with Capsules [22.99694601595627]
We propose a novel deep network, learning to transfer multi-scale partial gait representations using capsules. Our network first obtains multi-scale partial representations using a state-of-the-art deep partial feature extractor. It then recurrently learns the correlations and co-occurrences of the patterns among the partial features in forward and backward directions.
arXiv Detail & Related papers (2020-10-18T19:47:38Z)
Symbiotic Adversarial Learning for Attribute-based Person Search [86.7506832053208]
We present a symbiotic adversarial learning framework, called SAL.Two GANs sit at the base of the framework in a symbiotic learning scheme. Specifically, two different types of generative adversarial networks learn collaboratively throughout the training process.
arXiv Detail & Related papers (2020-07-19T07:24:45Z)
FMT:Fusing Multi-task Convolutional Neural Network for Person Search [33.91664470686695]
We propose a fusing multi-task convolutional neural network(FMT-CNN) to tackle the correlation and heterogeneity of detection and re-identification. Experiment results on CUHK-SYSU Person Search dataset show that the performance of our proposed method is superior to state-of-the-art approaches.
arXiv Detail & Related papers (2020-03-01T05:20:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.