Related papers: Human-centric Behavior Description in Videos: New Benchmark and Model

Human-centric Behavior Description in Videos: New Benchmark and Model

URL: http://arxiv.org/abs/2310.02894v1
Date: Wed, 4 Oct 2023 15:31:02 GMT
Title: Human-centric Behavior Description in Videos: New Benchmark and Model
Authors: Lingru Zhou, Yiqi Gao, Manqing Zhang, Peng Wu, Peng Wang, and Yanning Zhang
Abstract summary: We construct a human-centric video surveillance captioning dataset, which provides detailed descriptions of the dynamic behaviors of 7,820 individuals. Based on this dataset, we can link individuals to their respective behaviors, allowing for further analysis of each person's behavior in surveillance videos.
Score: 37.96539992056626
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the domain of video surveillance, describing the behavior of each individual within the video is becoming increasingly essential, especially in complex scenarios with multiple individuals present. This is because describing each individual's behavior provides more detailed situational analysis, enabling accurate assessment and response to potential risks, ensuring the safety and harmony of public places. Currently, video-level captioning datasets cannot provide fine-grained descriptions for each individual's specific behavior. However, mere descriptions at the video-level fail to provide an in-depth interpretation of individual behaviors, making it challenging to accurately determine the specific identity of each individual. To address this challenge, we construct a human-centric video surveillance captioning dataset, which provides detailed descriptions of the dynamic behaviors of 7,820 individuals. Specifically, we have labeled several aspects of each person, such as location, clothing, and interactions with other elements in the scene, and these people are distributed across 1,012 videos. Based on this dataset, we can link individuals to their respective behaviors, allowing for further analysis of each person's behavior in surveillance videos. Besides the dataset, we propose a novel video captioning approach that can describe individual behavior in detail on a person-level basis, achieving state-of-the-art results. To facilitate further research in this field, we intend to release our dataset and code.

Related papers

Multi-identity Human Image Animation with Structural Video Diffusion [64.20452431561436]
We present Structural Video Diffusion, a novel framework for generating realistic multi-human videos. Our approach introduces identity-specific embeddings to maintain consistent appearances across individuals. We expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios.
arXiv Detail & Related papers (2025-04-05T10:03:49Z)
PV-VTT: A Privacy-Centric Dataset for Mission-Specific Anomaly Detection and Natural Language Interpretation [5.0923114224599555]
We present PV-VTT (Privacy Violation Video To Text), a unique multimodal dataset aimed at identifying privacy violations. PV-VTT provides detailed annotations for both video and text in scenarios. This privacy-focused approach allows researchers to use the dataset while protecting participant confidentiality.
arXiv Detail & Related papers (2024-10-30T01:02:20Z)
HabitAction: A Video Dataset for Human Habitual Behavior Recognition [3.7478789114676108]
Human habitual behaviors (HHBs) hold significant importance for analyzing a person's personality, habits, and psychological changes. In this work, we build a novel video dataset to demonstrate various HHBs. The dataset contains 30 categories of habitual behaviors including more than 300,000 frames and 6,899 action instances.
arXiv Detail & Related papers (2024-08-24T04:40:31Z)
CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects. We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z)
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing. The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states. The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z)
MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain [23.598727613908853]
We present MECCANO, a dataset of egocentric videos to study humans behavior understanding in industrial-like settings. The multimodality is characterized by the presence of gaze signals, depth maps and RGB videos acquired simultaneously with a custom headset. The dataset has been explicitly labeled for fundamental tasks in the context of human behavior understanding from a first person view.
arXiv Detail & Related papers (2022-09-19T00:52:42Z)
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events. We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z)
Improved Actor Relation Graph based Group Activity Recognition [0.0]
The detailed description of human actions and group activities is essential information, which can be used in real-time CCTV video surveillance, health care, sports video analysis, etc. This study proposes a video understanding method that mainly focused on group activity recognition by learning the pair-wise actor appearance similarity and actor positions.
arXiv Detail & Related papers (2020-10-24T19:46:49Z)
Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset for Personality Assessment [50.15466026089435]
We present a novel peer-to-peer Hindi conversation dataset- Vyaktitv. It consists of high-quality audio and video recordings of the participants, with Hinglish textual transcriptions for each conversation. The dataset also contains a rich set of socio-demographic features, like income, cultural orientation, amongst several others, for all the participants.
arXiv Detail & Related papers (2020-08-31T17:44:28Z)
Human in Events: A Large-Scale Benchmark for Human-centric Video Analysis in Complex Events [106.19047816743988]
We present a new large-scale dataset with comprehensive annotations, named Human-in-Events or HiEve. It contains a record number of poses (>1M), the largest number of action instances (>56k) under complex events, as well as one of the largest numbers of trajectories lasting for longer time. Based on its diverse annotation, we present two simple baselines for action recognition and pose estimation.
arXiv Detail & Related papers (2020-05-09T18:24:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.