Referring to Any Person
- URL: http://arxiv.org/abs/2503.08507v1
- Date: Tue, 11 Mar 2025 14:57:14 GMT
- Title: Referring to Any Person
- Authors: Qing Jiang, Lin Wu, Zhaoyang Zeng, Tianhe Ren, Yuda Xiong, Yihao Chen, Qin Liu, Lei Zhang,
- Abstract summary: Existing models fail to achieve real-world usability, and current benchmarks are limited by their focus on one-to-one referring.<n>We introduce HumanRef, a novel dataset designed to tackle these challenges and better reflect real-world applications.<n>From a model design perspective, we integrate a multimodal large language model with an object detection framework, constructing a robust referring model named RexSeek.
- Score: 15.488874769107092
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans are undoubtedly the most important participants in computer vision, and the ability to detect any individual given a natural language description, a task we define as referring to any person, holds substantial practical value. However, we find that existing models generally fail to achieve real-world usability, and current benchmarks are limited by their focus on one-to-one referring, that hinder progress in this area. In this work, we revisit this task from three critical perspectives: task definition, dataset design, and model architecture. We first identify five aspects of referable entities and three distinctive characteristics of this task. Next, we introduce HumanRef, a novel dataset designed to tackle these challenges and better reflect real-world applications. From a model design perspective, we integrate a multimodal large language model with an object detection framework, constructing a robust referring model named RexSeek. Experimental results reveal that state-of-the-art models, which perform well on commonly used benchmarks like RefCOCO/+/g, struggle with HumanRef due to their inability to detect multiple individuals. In contrast, RexSeek not only excels in human referring but also generalizes effectively to common object referring, making it broadly applicable across various perception tasks. Code is available at https://github.com/IDEA-Research/RexSeek
Related papers
- IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes [10.139461308573336]
IRef-VLA is the largest real-world dataset for the referential grounding task consisting of over 11.5K scanned 3D rooms.
We aim to provide a resource for 3D scene understanding that aids the development of robust, interactive navigation systems.
arXiv Detail & Related papers (2025-03-20T16:16:10Z) - RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios [60.772871735598706]
RefHCM (Referring Human-Centric Model) is a framework to integrate a wide range of human-centric referring tasks.<n> RefHCM employs sequence mergers to convert raw multimodal data -- including images, text, coordinates, and parsing maps -- into semantic tokens.<n>This work represents the first attempt to address referring human perceptions with a general-purpose framework.
arXiv Detail & Related papers (2024-12-19T08:51:57Z) - Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics [37.86612817818566]
General purpose vision-language models, such as CLIP and large multi-modal models (LMMs), can be applied as zero-shot perceptual metrics.<n>We introduce UniSim-Bench, a benchmark encompassing 7 multi-modal perceptual similarity tasks with a total of 25 datasets.<n>Our evaluation reveals that while general-purpose models perform reasonably well on average, they often lag behind specialized models on individual tasks.
arXiv Detail & Related papers (2024-12-13T22:38:09Z) - HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks [25.959032350818795]
We present HumanEval-V, a benchmark of human-annotated coding tasks.
Each task features carefully crafted diagrams paired with function signatures and test cases.
We find that even top-performing models achieve modest success rates.
arXiv Detail & Related papers (2024-10-16T09:04:57Z) - Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape.
We collect 35K trials of behavioral data from over 500 participants.
We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z) - Hulk: A Universal Knowledge Translator for Human-Centric Tasks [69.8518392427151]
We present Hulk, the first multimodal human-centric generalist model.
It addresses 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning.
Hulk achieves state-of-the-art performance in 11 benchmarks.
arXiv Detail & Related papers (2023-12-04T07:36:04Z) - ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life
Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints.
Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal.
We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z) - Recovering 3D Human Mesh from Monocular Images: A Survey [49.00136388529404]
Estimating human pose and shape from monocular images is a long-standing problem in computer vision.
This survey focuses on the task of monocular 3D human mesh recovery.
arXiv Detail & Related papers (2022-03-03T18:56:08Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Towards Understanding Sample Variance in Visually Grounded Language
Generation: Evaluations and Observations [67.4375210552593]
We design experiments to understand an important but often ignored problem in visually grounded language generation.
Given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance?
We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task.
arXiv Detail & Related papers (2020-10-07T20:45:14Z) - Words aren't enough, their order matters: On the Robustness of Grounding
Visual Referring Expressions [87.33156149634392]
We critically examine RefCOg, a standard benchmark for visual referring expression recognition.
We show that 83.7% of test instances do not require reasoning on linguistic structure.
We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
arXiv Detail & Related papers (2020-05-04T17:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.