RefCrowd: Grounding the Target in Crowd with Referring Expressions
- URL: http://arxiv.org/abs/2206.08172v1
- Date: Thu, 16 Jun 2022 13:39:26 GMT
- Title: RefCrowd: Grounding the Target in Crowd with Referring Expressions
- Authors: Heqian Qiu, Hongliang Li, Taijin Zhao, Lanxiao Wang, Qingbo Wu and
Fanman Meng
- Abstract summary: We propose RefCrowd, which towards looking for the target person in crowd with referring expressions.
It not only requires to sufficiently mine the natural language information, but also requires to carefully focus on subtle differences between the target and a crowd of persons with similar appearance.
We also propose a Fine-grained Multi-modal Attribute Contrastive Network (FMAC) to deal with REF in crowd understanding.
- Score: 20.822504213866726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Crowd understanding has aroused the widespread interest in vision domain due
to its important practical significance. Unfortunately, there is no effort to
explore crowd understanding in multi-modal domain that bridges natural language
and computer vision. Referring expression comprehension (REF) is such a
representative multi-modal task. Current REF studies focus more on grounding
the target object from multiple distinctive categories in general scenarios. It
is difficult to applied to complex real-world crowd understanding. To fill this
gap, we propose a new challenging dataset, called RefCrowd, which towards
looking for the target person in crowd with referring expressions. It not only
requires to sufficiently mine the natural language information, but also
requires to carefully focus on subtle differences between the target and a
crowd of persons with similar appearance, so as to realize the fine-grained
mapping from language to vision. Furthermore, we propose a Fine-grained
Multi-modal Attribute Contrastive Network (FMAC) to deal with REF in crowd
understanding. It first decomposes the intricate visual and language features
into attribute-aware multi-modal features, and then captures discriminative but
robustness fine-grained attribute features to effectively distinguish these
subtle differences between similar persons. The proposed method outperforms
existing state-of-the-art (SoTA) methods on our RefCrowd dataset and existing
REF datasets. In addition, we implement an end-to-end REF toolbox for the
deeper research in multi-modal domain. Our dataset and code can be available
at: \url{https://qiuheqian.github.io/datasets/refcrowd/}.
Related papers
- FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension [10.482908189805872]
Referring Expression (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding.
We have established a new REC dataset characterized by two key features.
It includes negative text and images created through fine-grained editing and generation based on existing data.
arXiv Detail & Related papers (2024-09-23T06:56:51Z) - More Pictures Say More: Visual Intersection Network for Open Set Object Detection [4.206612461069489]
We introduce a strong DETR-based model, Visual Intersection Network for Open Set Object Detection (VINO)
VINO constructs a multi-image visual bank to preserve the semantic intersections of each category across all time steps.
Our approach guarantees a more precise alignment between target category semantics and region semantics, while significantly reducing pre-training time and resource demands.
arXiv Detail & Related papers (2024-08-26T05:52:35Z) - Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification [64.36210786350568]
We propose a novel learning framework named textbfEDITOR to select diverse tokens from vision Transformers for multi-modal object ReID.
Our framework can generate more discriminative features for multi-modal object ReID.
arXiv Detail & Related papers (2024-03-15T12:44:35Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Multi-source Semantic Graph-based Multimodal Sarcasm Explanation
Generation [53.97962603641629]
We propose a novel mulTi-source sEmantic grAph-based Multimodal sarcasm explanation scheme, named TEAM.
TEAM extracts the object-level semantic meta-data instead of the traditional global visual features from the input image.
TEAM introduces a multi-source semantic graph that comprehensively characterize the multi-source semantic relations.
arXiv Detail & Related papers (2023-06-29T03:26:10Z) - Semantics-Aware Dynamic Localization and Refinement for Referring Image
Segmentation [102.25240608024063]
Referring image segments an image from a language expression.
We develop an algorithm that shifts from being localization-centric to segmentation-language.
Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z) - OmDet: Large-scale vision-language multi-dataset pre-training with
multimodal detection network [17.980765138522322]
This work introduces OmDet, a novel language-aware object detection architecture.
Leveraging natural language as a universal knowledge representation, OmDet accumulates a "visual vocabulary" from diverse datasets.
We demonstrate superior performance of OmDet over strong baselines in object detection in the wild, open-vocabulary detection, and phrase grounding.
arXiv Detail & Related papers (2022-09-10T14:25:14Z) - Self-paced Multi-grained Cross-modal Interaction Modeling for Referring
Expression Comprehension [21.000045864213327]
referring expression comprehension (REC) generally requires a large amount of multi-grained information of visual and linguistic modalities to realize accurate reasoning.
How to aggregate multi-grained information from different modalities and extract abundant knowledge from hard examples is crucial in the REC task.
We propose a Self-paced Multi-grained Cross-modal Interaction Modeling framework, which improves the language-to-vision localization ability.
arXiv Detail & Related papers (2022-04-21T08:32:47Z) - AF$_2$: Adaptive Focus Framework for Aerial Imagery Segmentation [86.44683367028914]
Aerial imagery segmentation has some unique challenges, the most critical one among which lies in foreground-background imbalance.
We propose Adaptive Focus Framework (AF$), which adopts a hierarchical segmentation procedure and focuses on adaptively utilizing multi-scale representations.
AF$ has significantly improved the accuracy on three widely used aerial benchmarks, as fast as the mainstream method.
arXiv Detail & Related papers (2022-02-18T10:14:45Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - AXM-Net: Cross-Modal Context Sharing Attention Network for Person Re-ID [20.700750237972155]
Cross-modal person re-identification (Re-ID) is critical for modern video surveillance systems.
Key challenge is to align inter-modality representations according to semantic information present for a person and ignore background information.
We present AXM-Net, a novel CNN based architecture designed for learning semantically aligned visual and textual representations.
arXiv Detail & Related papers (2021-01-19T16:06:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.