Efficient Vision Language Model Fine-tuning for Text-based Person Anomaly Search
- URL: http://arxiv.org/abs/2502.03230v1
- Date: Wed, 05 Feb 2025 14:45:09 GMT
- Title: Efficient Vision Language Model Fine-tuning for Text-based Person Anomaly Search
- Authors: Jiayi He, Shengeng Tang, Ao Liu, Lechao Cheng, Jingjing Wu, Yanyan Wei,
- Abstract summary: This paper presents the HFUT-LMC team's solution to the WWW 2025 challenge on Text-based Person Anomaly Search (TPAS)
The primary objective of this challenge is to accurately identify pedestrians exhibiting either normal or abnormal behavior within a large library of pedestrian images.
We introduce the Similarity Coverage Analysis (SCA) strategy to address the recognition difficulty caused by similar text descriptions.
- Score: 20.695290280579858
- License:
- Abstract: This paper presents the HFUT-LMC team's solution to the WWW 2025 challenge on Text-based Person Anomaly Search (TPAS). The primary objective of this challenge is to accurately identify pedestrians exhibiting either normal or abnormal behavior within a large library of pedestrian images. Unlike traditional video analysis tasks, TPAS significantly emphasizes understanding and interpreting the subtle relationships between text descriptions and visual data. The complexity of this task lies in the model's need to not only match individuals to text descriptions in massive image datasets but also accurately differentiate between search results when faced with similar descriptions. To overcome these challenges, we introduce the Similarity Coverage Analysis (SCA) strategy to address the recognition difficulty caused by similar text descriptions. This strategy effectively enhances the model's capacity to manage subtle differences, thus improving both the accuracy and reliability of the search. Our proposed solution demonstrated excellent performance in this challenge.
Related papers
- Improving Text-based Person Search via Part-level Cross-modal Correspondence [29.301950609839796]
We introduce an efficient encoder-decoder model that extracts coarse-to-fine embedding vectors.
There is another challenge of learning to capture fine-grained information with only person IDs as supervision.
We propose a novel ranking loss, dubbed commonality-based margin ranking loss, which quantifies the degree of commonality of each body part.
arXiv Detail & Related papers (2024-12-31T07:29:50Z) - Enhancing Visual Representation for Text-based Person Searching [9.601697802095119]
VFE-TPS is a Visual Feature Enhanced Text-based Person Search model.
It introduces a pre-trained backbone CLIP to learn basic multimodal features.
It constructs Text Guided Masked Image Modeling task to enhance the model's ability of learning local visual details.
arXiv Detail & Related papers (2024-12-30T01:38:14Z) - Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search [25.907668574771705]
We propose a new task, text-based person anomaly search, locating pedestrians engaged in both routine or anomalous activities via text.
To enable the training and evaluation of this new task, we construct a large-scale image-text Pedestrian Anomaly Behavior benchmark.
We introduce a cross-modal pose-aware framework, which integrates human pose patterns with identity-based hard negative pair sampling.
arXiv Detail & Related papers (2024-11-26T09:50:15Z) - Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing [71.29488677105127]
Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters.
We propose a contrastive learning-based STR framework by leveraging synthetic and real unlabeled data without any human cost.
Our method achieves SOTA performance (94.7% and 70.9% average accuracy on common benchmarks and Union14M-Benchmark.
arXiv Detail & Related papers (2024-11-23T15:24:47Z) - Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval [66.61856014573742]
Text-based person retrieval (TPR) is a challenging task that involves retrieving a specific individual based on a textual description.
Previous methods have attempted to align text and image samples in a modal-shared space.
We propose an effective bi-directional one-to-many embedding paradigm that offers a clear optimization direction for each sample.
arXiv Detail & Related papers (2024-06-09T03:06:55Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - PRIME: Prioritizing Interpretability in Failure Mode Extraction [49.93565079216376]
We study the challenge of providing human-understandable descriptions for failure modes in trained image classification models.
We propose a novel approach that prioritizes interpretability in this problem.
Our method successfully identifies failure modes and generates high-quality text descriptions associated with them.
arXiv Detail & Related papers (2023-09-29T22:00:12Z) - ALADIN: Distilling Fine-grained Alignment Scores for Efficient
Image-Text Matching and Retrieval [51.588385824875886]
Cross-modal retrieval consists in finding images related to a given query text or vice-versa.
Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks.
This paper proposes an ALign And DIstill Network (ALADIN) to fill in the gap between effectiveness and efficiency.
arXiv Detail & Related papers (2022-07-29T16:01:48Z) - Exploring Multi-Modal Representations for Ambiguity Detection &
Coreference Resolution in the SIMMC 2.0 Challenge [60.616313552585645]
We present models for effective Ambiguity Detection and Coreference Resolution in Conversational AI.
Specifically, we use TOD-BERT and LXMERT based models, compare them to a number of baselines and provide ablation experiments.
Our results show that (1) language models are able to exploit correlations in the data to detect ambiguity; and (2) unimodal coreference resolution models can avoid the need for a vision component.
arXiv Detail & Related papers (2022-02-25T12:10:02Z) - Learning Semantic-Aligned Feature Representation for Text-based Person
Search [8.56017285139081]
We propose a semantic-aligned embedding method for text-based person search.
The feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features.
Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.
arXiv Detail & Related papers (2021-12-13T14:54:38Z) - Text-based Person Search in Full Images via Semantic-Driven Proposal
Generation [42.25611020956918]
We propose a new end-to-end learning framework which jointly optimize the pedestrian detection, identification and visual-semantic feature embedding tasks.
To take full advantage of the query text, the semantic features are leveraged to instruct the Region Proposal Network to pay more attention to the text-described proposals.
arXiv Detail & Related papers (2021-09-27T11:42:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.