Portrait Interpretation and a Benchmark
- URL: http://arxiv.org/abs/2207.13315v1
- Date: Wed, 27 Jul 2022 06:25:09 GMT
- Title: Portrait Interpretation and a Benchmark
- Authors: Yixuan Fan, Zhaopeng Dou, Yali Li, Shengjin Wang
- Abstract summary: The proposed portrait interpretation recognizes the perception of humans from a new systematic perspective.
We construct a new dataset that contains 250,000 images labeled with identity, gender, age, physique, height, expression, and posture of the whole body and arms.
Our experimental results demonstrate that combining the tasks related to portrait interpretation can yield benefits.
- Score: 49.484161789329804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a task we name Portrait Interpretation and construct a dataset
named Portrait250K for it. Current researches on portraits such as human
attribute recognition and person re-identification have achieved many
successes, but generally, they: 1) may lack mining the interrelationship
between various tasks and the possible benefits it may bring; 2) design deep
models specifically for each task, which is inefficient; 3) may be unable to
cope with the needs of a unified model and comprehensive perception in actual
scenes. In this paper, the proposed portrait interpretation recognizes the
perception of humans from a new systematic perspective. We divide the
perception of portraits into three aspects, namely Appearance, Posture, and
Emotion, and design corresponding sub-tasks for each aspect. Based on the
framework of multi-task learning, portrait interpretation requires a
comprehensive description of static attributes and dynamic states of portraits.
To invigorate research on this new task, we construct a new dataset that
contains 250,000 images labeled with identity, gender, age, physique, height,
expression, and posture of the whole body and arms. Our dataset is collected
from 51 movies, hence covering extensive diversity. Furthermore, we focus on
representation learning for portrait interpretation and propose a baseline that
reflects our systematic perspective. We also propose an appropriate metric for
this task. Our experimental results demonstrate that combining the tasks
related to portrait interpretation can yield benefits. Code and dataset will be
made public.
Related papers
- Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape.
We collect 35K trials of behavioral data from over 500 participants.
We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z) - Panoptic Perception: A Novel Task and Fine-grained Dataset for Universal Remote Sensing Image Interpretation [19.987706084203523]
We propose Panoptic Perception, a novel task and a new fine-grained dataset (FineGrip) to achieve a more thorough and universal interpretation for RSIs.
The new task integrates pixel-level, instance-level, and image-level information for universal image perception.
FineGrip dataset includes 2,649 remote sensing images, 12,054 fine-grained instance segmentation masks belonging to 20 foreground things categories, 7,599 background semantic masks for 5 stuff classes and 13,245 captioning sentences.
arXiv Detail & Related papers (2024-04-06T12:27:21Z) - Stellar: Systematic Evaluation of Human-Centric Personalized
Text-to-Image Methods [52.806258774051216]
We focus on text-to-image systems that input a single image of an individual and ground the generation process along with text describing the desired visual context.
We introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available.
We derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA.
arXiv Detail & Related papers (2023-12-11T04:47:39Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - Enhancing the Authenticity of Rendered Portraits with
Identity-Consistent Transfer Learning [30.64677966402945]
We present a novel photo-realistic portrait generation framework that can effectively mitigate the ''uncanny valley'' effect.
Our key idea is to employ transfer learning to learn an identity-consistent mapping from the latent space of rendered portraits to that of real portraits.
arXiv Detail & Related papers (2023-10-06T12:20:40Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark [80.79082788458602]
We provide a new multi-task benchmark for evaluating text-to-image models.
We compare the most common open-source (Stable Diffusion) and commercial (DALL-E 2) models.
Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each.
arXiv Detail & Related papers (2022-11-22T09:27:53Z) - FixMyPose: Pose Correctional Captioning and Retrieval [67.20888060019028]
We introduce a new captioning dataset named FixMyPose to address automated pose correction systems.
We collect descriptions of correcting a "current" pose to look like a "target" pose.
To avoid ML biases, we maintain a balance across characters with diverse demographics.
arXiv Detail & Related papers (2021-04-04T21:45:44Z) - Visual Relationship Detection using Scene Graphs: A Survey [1.3505077405741583]
A Scene Graph is a technique to better represent a scene and the various relationships present in it.
We present a detailed survey on the various techniques for scene graph generation, their efficacy to represent visual relationships and how it has been used to solve various downstream tasks.
arXiv Detail & Related papers (2020-05-16T17:06:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.