HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
- URL: http://arxiv.org/abs/2501.15111v1
- Date: Sat, 25 Jan 2025 07:26:37 GMT
- Title: HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
- Authors: Jiaxing Zhao, Qize Yang, Yixing Peng, Detao Bai, Shimin Yao, Boyuan Sun, Xiang Chen, Shenghao Fu, Weixuan chen, Xihan Wei, Liefeng Bo,
- Abstract summary: Human Omni is the industry's first human-centric Omni-multimodal large language model.
We constructed a dataset containing over 2.4 million human-centric video clips with detailed captions and more than 14 million instructions.
Our experiments validate Human Omni's advanced capabilities in handling human-centric scenes across a variety of tasks.
- Score: 16.93348898548816
- License:
- Abstract: In human-centric scenes, the ability to simultaneously understand visual and auditory information is crucial. While recent omni models can process multiple modalities, they generally lack effectiveness in human-centric scenes due to the absence of large-scale, specialized datasets and non-targeted architectures. In this work, we developed HumanOmni, the industry's first human-centric Omni-multimodal large language model. We constructed a dataset containing over 2.4 million human-centric video clips with detailed captions and more than 14 million instructions, facilitating the understanding of diverse human-centric scenes. HumanOmni includes three specialized branches for understanding different types of scenes. It adaptively fuses features from these branches based on user instructions, significantly enhancing visual understanding in scenes centered around individuals. Moreover, HumanOmni integrates audio features to ensure a comprehensive understanding of environments and individuals. Our experiments validate HumanOmni's advanced capabilities in handling human-centric scenes across a variety of tasks, including emotion recognition, facial expression description, and action understanding. Our model will be open-sourced to facilitate further development and collaboration within both academia and industry.
Related papers
- HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs.
HumanVBench comprises 17 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects.
arXiv Detail & Related papers (2024-12-23T13:45:56Z) - HumanVLM: Foundation for Human-Scene Vision-Language Model [3.583459930633303]
Human-scene vision-language tasks are increasingly prevalent in diverse social applications.
This study introduces a domain-specific Large Vision-Language Model, Human-Scene Vision-Language Model (HumanVLM)
In the experiments, we then evaluate our HumanVLM across varous downstream tasks, where it demonstrates superior overall performance.
arXiv Detail & Related papers (2024-11-05T12:14:57Z) - A Unified Framework for Human-centric Point Cloud Video Understanding [23.91555808792291]
Human-centric Point Cloud Video Understanding (PVU) is an emerging field focused on extracting and interpreting human-related features from sequences of human point clouds.
We propose a unified framework to make full use of the prior knowledge and explore the inherent features in the data itself for generalized human-centric point cloud video understanding.
Our method achieves state-of-the-art performance on various human-related tasks, including action recognition and 3D pose estimation.
arXiv Detail & Related papers (2024-03-29T07:53:06Z) - Hulk: A Universal Knowledge Translator for Human-Centric Tasks [69.8518392427151]
We present Hulk, the first multimodal human-centric generalist model.
It addresses 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning.
Hulk achieves state-of-the-art performance in 11 benchmarks.
arXiv Detail & Related papers (2023-12-04T07:36:04Z) - Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives [194.06650316685798]
Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities.
740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts.
Video accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions.
arXiv Detail & Related papers (2023-11-30T05:21:07Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Human-centric Scene Understanding for 3D Large-scale Scenarios [52.12727427303162]
We present a large-scale multi-modal dataset for human-centric scene understanding, dubbed HuCenLife.
Our HuCenLife can benefit many 3D perception tasks, such as segmentation, detection, action recognition, etc.
arXiv Detail & Related papers (2023-07-26T08:40:46Z) - Contextually-rich human affect perception using multimodal scene
information [36.042369831043686]
We leverage pretrained vision-language (VLN) models to extract descriptions of foreground context from images.
We propose a multimodal context fusion (MCF) module to combine foreground cues with the visual scene and person-based contextual information for emotion prediction.
We show the effectiveness of our proposed modular design on two datasets associated with natural scenes and TV shows.
arXiv Detail & Related papers (2023-03-13T07:46:41Z) - Find Someone Who: Visual Commonsense Understanding in Human-Centric
Grounding [87.39245901710079]
We present a new commonsense task, Human-centric Commonsense Grounding.
It tests the models' ability to ground individuals given the context descriptions about what happened before.
We set up a context-object-aware method as a strong baseline that outperforms previous pre-trained and non-pretrained models.
arXiv Detail & Related papers (2022-12-14T01:37:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.