Related papers: Background Matters Too: A Language-Enhanced Adversarial Framework for Person Re-Identification

Background Matters Too: A Language-Enhanced Adversarial Framework for Person Re-Identification

URL: http://arxiv.org/abs/2509.03032v1
Date: Wed, 03 Sep 2025 05:38:22 GMT
Title: Background Matters Too: A Language-Enhanced Adversarial Framework for Person Re-Identification
Authors: Kaicong Huang, Talha Azfar, Jack M. Reilly, Thomas Guggisberg, Ruimin Ke,
Abstract summary: We argue that background semantics are as important as the foreground semantics in ReID.<n>This paper proposes an end-to-end framework that jointly models foreground and background information.
Score: 1.409283414986451
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Person re-identification faces two core challenges: precisely locating the foreground target while suppressing background noise and extracting fine-grained features from the target region. Numerous visual-only approaches address these issues by partitioning an image and applying attention modules, yet they rely on costly manual annotations and struggle with complex occlusions. Recent multimodal methods, motivated by CLIP, introduce semantic cues to guide visual understanding. However, they focus solely on foreground information, but overlook the potential value of background cues. Inspired by human perception, we argue that background semantics are as important as the foreground semantics in ReID, as humans tend to eliminate background distractions while focusing on target appearance. Therefore, this paper proposes an end-to-end framework that jointly models foreground and background information within a dual-branch cross-modal feature extraction pipeline. To help the network distinguish between the two domains, we propose an intra-semantic alignment and inter-semantic adversarial learning strategy. Specifically, we align visual and textual features that share the same semantics across domains, while simultaneously penalizing similarity between foreground and background features to enhance the network's discriminative power. This strategy drives the model to actively suppress noisy background regions and enhance attention toward identity-relevant foreground cues. Comprehensive experiments on two holistic and two occluded ReID benchmarks demonstrate the effectiveness and generality of the proposed method, with results that match or surpass those of current state-of-the-art approaches.

Related papers

Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance [36.23578004588688]
We propose Foreground-Aware Slot Attention (FASA), a two-stage framework that separates foreground from background to enable precise object discovery.<n>In the first stage, FASA performs a coarse scene decomposition to distinguish foreground from background regions.<n>In the second stage, we introduce a masked slot attention mechanism where the first slot captures the background while the remaining slots compete to represent individual foreground objects.<n>Experiments on both synthetic and real-world datasets demonstrate that FASA consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-12-02T12:14:05Z)
Decouple before Align: Visual Disentanglement Enhances Prompt Tuning [85.91474962071452]
Prompt tuning (PT) has showcased remarkable effectiveness in improving the task-specific transferability of vision-language models.<n>This paper delves into a previously overlooked information asymmetry issue in PT, where the visual modality mostly conveys more context.<n>We propose DAPT, an effective PT framework based on an intuitive decouple-before-align concept.
arXiv Detail & Related papers (2025-08-01T07:46:00Z)
OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval [59.377821673653436]
Composed Image Retrieval (CIR) is capable of expressing users' intricate retrieval requirements flexibly.<n>CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation.<n>This work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping.
arXiv Detail & Related papers (2025-07-08T03:27:46Z)
Attention Disturbance and Dual-Path Constraint Network for Occluded Person Re-identification [36.86516784815214]
We propose a transformer-based Attention Disturbance and Dual-Path Constraint Network (ADP) to enhance the generalization of attention networks. To imitate real-world obstacles, we introduce an Attention Disturbance Mask (ADM) module that generates an offensive noise. We also develop a Dual-Path Constraint Module (DPC) that can obtain preferable supervision information from holistic images.
arXiv Detail & Related papers (2023-03-20T09:56:35Z)
Progressively Dual Prior Guided Few-shot Semantic Segmentation [57.37506990980975]
Few-shot semantic segmentation task aims at performing segmentation in query images with a few annotated support samples. We propose a progressively dual prior guided few-shot semantic segmentation network.
arXiv Detail & Related papers (2022-11-20T16:19:47Z)
Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text. Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities. We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z)
Semantic Reinforced Attention Learning for Visual Place Recognition [15.84086970453363]
Large-scale visual place recognition (VPR) is inherently challenging because not all visual cues in the image are beneficial to the task. We propose a novel Semantic Reinforced Attention Learning Network (SRALNet), in which the inferred attention can benefit from both semantic priors and data-driven fine-tuning. Experiments demonstrate that our method outperforms state-of-the-art techniques on city-scale VPR benchmark datasets.
arXiv Detail & Related papers (2021-08-19T02:14:36Z)
Dual-path CNN with Max Gated block for Text-Based Person Re-identification [6.1534388046236765]
A novel Dual-path CNN with Max Gated block (DCMG) is proposed to extract discriminative word embeddings. The framework is based on two deep residual CNNs jointly optimized with cross-modal projection matching. Our approach achieves the rank-1 score of 55.81% and outperforms the state-of-the-art method by 1.3%.
arXiv Detail & Related papers (2020-09-20T03:33:29Z)
Rethinking of the Image Salient Object Detection: Object-level Semantic Saliency Re-ranking First, Pixel-wise Saliency Refinement Latter [62.26677215668959]
We propose a lightweight, weakly supervised deep network to coarsely locate semantically salient regions. We then fuse multiple off-the-shelf deep models on these semantically salient regions as the pixel-wise saliency refinement. Our method is simple yet effective, which is the first attempt to consider the salient object detection mainly as an object-level semantic re-ranking problem.
arXiv Detail & Related papers (2020-08-10T07:12:43Z)
Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences. In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference. Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.