Related papers: HumanVLM: Foundation for Human-Scene Vision-Language Model

HumanVLM: Foundation for Human-Scene Vision-Language Model

URL: http://arxiv.org/abs/2411.03034v1
Date: Tue, 05 Nov 2024 12:14:57 GMT
Title: HumanVLM: Foundation for Human-Scene Vision-Language Model
Authors: Dawei Dai, Xu Long, Li Yutang, Zhang Yuanhui, Shuyin Xia,
Abstract summary: Human-scene vision-language tasks are increasingly prevalent in diverse social applications. This study introduces a domain-specific Large Vision-Language Model, Human-Scene Vision-Language Model (HumanVLM) In the experiments, we then evaluate our HumanVLM across varous downstream tasks, where it demonstrates superior overall performance.
Score: 3.583459930633303
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human-scene vision-language tasks are increasingly prevalent in diverse social applications, yet recent advancements predominantly rely on models specifically tailored to individual tasks. Emerging research indicates that large vision-language models (VLMs) can enhance performance across various downstream vision-language understanding tasks. However, general-domain models often underperform in specialized fields. This study introduces a domain-specific Large Vision-Language Model, Human-Scene Vision-Language Model (HumanVLM), designed to provide a foundation for human-scene Vision-Language tasks. Specifically, (1) we create a large-scale human-scene multimodal image-text dataset (HumanCaption-10M) sourced from the Internet to facilitate domain-specific alignment; (2) develop a captioning approach for human-centered images, capturing human faces, bodies, and backgrounds, and construct a high-quality Human-Scene image-text dataset (HumanCaptionHQ, about 311k pairs) that contain as much detailed information as possible about human; (3) Using HumanCaption-10M and HumanCaptionHQ, we train a HumanVLM. In the experiments, we then evaluate our HumanVLM across varous downstream tasks, where it demonstrates superior overall performance among multimodal models of comparable scale, particularly excelling in human-related tasks and significantly outperforming similar models, including Qwen2VL and ChatGPT-4o. HumanVLM, alongside the data introduced, will stimulate the research in human-around fields.

Related papers

Human-Centric Foundation Models: Perception, Generation and Agentic Modeling [79.97999901785772]
Human-centric Foundation Models unify diverse human-centric tasks into a single framework. We present a comprehensive overview of HcFMs by proposing a taxonomy that categorizes current approaches into four groups. This survey aims to serve as a roadmap for researchers and practitioners working towards more robust, versatile, and intelligent digital human and embodiments modeling.
arXiv Detail & Related papers (2025-02-12T16:38:40Z)
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding [16.93348898548816]
Human Omni is the industry's first human-centric Omni-multimodal large language model. We constructed a dataset containing over 2.4 million human-centric video clips with detailed captions and more than 14 million instructions. Our experiments validate Human Omni's advanced capabilities in handling human-centric scenes across a variety of tasks.
arXiv Detail & Related papers (2025-01-25T07:26:37Z)
Human Multi-View Synthesis from a Single-View Model:Transferred Body and Face Representations [7.448124739584319]
We propose an innovative framework that leverages transferred body and facial representations for multi-view human synthesis. Specifically, we use a single-view model pretrained on a large-scale human dataset to develop a multi-view body representation. Our approach outperforms the current state-of-the-art methods, achieving superior performance in multi-view human synthesis.
arXiv Detail & Related papers (2024-12-04T04:02:17Z)
GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data [61.05815629606135]
Given a single in-the-wild human photo, it remains a challenging task to reconstruct a high-fidelity 3D human model. GeneMAN builds upon a comprehensive collection of high-quality human data. GeneMAN could generate high-quality 3D human models from a single image input, outperforming prior state-of-the-art methods.
arXiv Detail & Related papers (2024-11-27T18:59:54Z)
High-Dimension Human Value Representation in Large Language Models [60.33033114185092]
We propose UniVaR, a high-dimensional representation of human value distributions in Large Language Models (LLMs) We show that UniVaR is a powerful tool to compare the distribution of human values embedded in different LLMs with different langauge sources.
arXiv Detail & Related papers (2024-04-11T16:39:00Z)
Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance [48.986552871497]
We introduce a novel two-stage framework that employs scene affordance as an intermediate representation. By leveraging scene affordance maps, our method overcomes the difficulty in generating human motion under multimodal condition signals. Our approach consistently outperforms all baselines on established benchmarks, including HumanML3D and HUMANISE.
arXiv Detail & Related papers (2024-03-26T18:41:07Z)
CapHuman: Capture Your Moments in Parallel Universes [60.06408546134581]
We present a new framework named CapHuman. CapHuman encodes identity features and then learns to align them into the latent space. We introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner.
arXiv Detail & Related papers (2024-02-01T14:41:59Z)
MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures [44.172804112944625]
We present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions.
arXiv Detail & Related papers (2023-12-05T18:50:12Z)
Hulk: A Universal Knowledge Translator for Human-Centric Tasks [69.8518392427151]
We present Hulk, the first multimodal human-centric generalist model. It addresses 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. Hulk achieves state-of-the-art performance in 11 benchmarks.
arXiv Detail & Related papers (2023-12-04T07:36:04Z)
Human-centric Scene Understanding for 3D Large-scale Scenarios [52.12727427303162]
We present a large-scale multi-modal dataset for human-centric scene understanding, dubbed HuCenLife. Our HuCenLife can benefit many 3D perception tasks, such as segmentation, detection, action recognition, etc.
arXiv Detail & Related papers (2023-07-26T08:40:46Z)
Human Language Modeling [20.66485974271458]
We propose a hierarchical extension to the language modeling problem whereby a human-level exists to connect sequences of documents. We introduce, HaRT, a large-scale transformer model for the HuLM task, pre-trained on approximately 100,000 social media users. Results on all tasks meet or surpass the current state-of-the-art.
arXiv Detail & Related papers (2022-05-10T19:11:12Z)
HUMBI: A Large Multiview Dataset of Human Body Expressions and Benchmark Challenge [33.26419876973344]
This paper presents a new large multiview dataset called HUMBI for human body expressions with natural clothing. 107 synchronized HD cameras are used to capture 772 distinctive subjects across gender, ethnicity, age, and style. We reconstruct high fidelity body expressions using 3D mesh models, which allows representing view-specific appearance.
arXiv Detail & Related papers (2021-09-30T23:19:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.