Contextual Sense Making by Fusing Scene Classification, Detections, and
Events in Full Motion Video
- URL: http://arxiv.org/abs/2001.05979v1
- Date: Thu, 16 Jan 2020 18:26:34 GMT
- Title: Contextual Sense Making by Fusing Scene Classification, Detections, and
Events in Full Motion Video
- Authors: Marc Bosch, Joseph Nassar, Benjamin Ortiz, Brendan Lammers, David
Lindenbaum, John Wahl, Robert Mangum, and Margaret Smith
- Abstract summary: We aim to address the needs of human analysts to consume and exploit data given aerial FMV.
We have divided the problem into three tasks: (1) Context awareness, (2) object cataloging, and (3) event detection.
We have applied our methods on data from different sensors at different resolutions in a variety of geographical areas.
- Score: 0.7348448478819135
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the proliferation of imaging sensors, the volume of multi-modal imagery
far exceeds the ability of human analysts to adequately consume and exploit it.
Full motion video (FMV) possesses the extra challenge of containing large
amounts of redundant temporal data. We aim to address the needs of human
analysts to consume and exploit data given aerial FMV. We have investigated and
designed a system capable of detecting events and activities of interest that
deviate from the baseline patterns of observation given FMV feeds. We have
divided the problem into three tasks: (1) Context awareness, (2) object
cataloging, and (3) event detection. The goal of context awareness is to
constraint the problem of visual search and detection in video data. A custom
image classifier categorizes the scene with one or multiple labels to identify
the operating context and environment. This step helps reducing the semantic
search space of downstream tasks in order to increase their accuracy. The
second step is object cataloging, where an ensemble of object detectors locates
and labels any known objects found in the scene (people, vehicles, boats,
planes, buildings, etc.). Finally, context information and detections are sent
to the event detection engine to monitor for certain behaviors. A series of
analytics monitor the scene by tracking object counts, and object interactions.
If these object interactions are not declared to be commonly observed in the
current scene, the system will report, geolocate, and log the event. Events of
interest include identifying a gathering of people as a meeting and/or a crowd,
alerting when there are boats on a beach unloading cargo, increased count of
people entering a building, people getting in and/or out of vehicles of
interest, etc. We have applied our methods on data from different sensors at
different resolutions in a variety of geographical areas.
Related papers
- Analysis of Unstructured High-Density Crowded Scenes for Crowd Monitoring [55.2480439325792]
We are interested in developing an automated system for detection of organized movements in human crowds.
Computer vision algorithms can extract information from videos of crowded scenes.
We can estimate the number of participants in an organized cohort.
arXiv Detail & Related papers (2024-08-06T22:09:50Z) - Visual Context-Aware Person Fall Detection [52.49277799455569]
We present a segmentation pipeline to semi-automatically separate individuals and objects in images.
Background objects such as beds, chairs, or wheelchairs can challenge fall detection systems, leading to false positive alarms.
We demonstrate that object-specific contextual transformations during training effectively mitigate this challenge.
arXiv Detail & Related papers (2024-04-11T19:06:36Z) - Detecting Events in Crowds Through Changes in Geometrical Dimensions of
Pedestrians [0.6390468088226495]
We examine three different scenarios of crowd behavior, containing both the cases where an event triggers a change in the behavior of the crowd and two video sequences where the crowd and its motion remain mostly unchanged.
With both the videos and the tracking of the individual pedestrians (performed in a pre-processed phase), we use Geomind to extract significant data about the scene, in particular, the geometrical features, personalities, and emotions of each person.
We then examine the output, seeking a significant change in the way each person acts as a function of the time, that could be used as a basis to identify events or to model realistic crowd
arXiv Detail & Related papers (2023-12-11T16:18:56Z) - Dual Memory Aggregation Network for Event-Based Object Detection with
Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner.
Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation.
Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z) - FGAHOI: Fine-Grained Anchors for Human-Object Interaction Detection [4.534713782093219]
A novel end-to-end transformer-based framework (FGAHOI) is proposed to alleviate the above problems.
FGAHOI comprises three dedicated components namely, multi-scale sampling (MSS), hierarchical spatial-aware merging (HSAM) and task-aware merging mechanism (TAM)
arXiv Detail & Related papers (2023-01-08T03:53:50Z) - MECCANO: A Multimodal Egocentric Dataset for Humans Behavior
Understanding in the Industrial-like Domain [23.598727613908853]
We present MECCANO, a dataset of egocentric videos to study humans behavior understanding in industrial-like settings.
The multimodality is characterized by the presence of gaze signals, depth maps and RGB videos acquired simultaneously with a custom headset.
The dataset has been explicitly labeled for fundamental tasks in the context of human behavior understanding from a first person view.
arXiv Detail & Related papers (2022-09-19T00:52:42Z) - A Dynamic Data Driven Approach for Explainable Scene Understanding [0.0]
Scene-understanding is an important topic in the area of Computer Vision.
We consider the active explanation-driven understanding and classification of scenes.
Our framework is entitled ACUMEN: Active Classification and Understanding Method by Explanation-driven Networks.
arXiv Detail & Related papers (2022-06-18T02:41:51Z) - Video Action Detection: Analysing Limitations and Challenges [70.01260415234127]
We analyze existing datasets on video action detection and discuss their limitations.
We perform a biasness study which analyzes a key property differentiating videos from static images: the temporal aspect.
Such extreme experiments show existence of biases which have managed to creep into existing methods inspite of careful modeling.
arXiv Detail & Related papers (2022-04-17T00:42:14Z) - Finding a Needle in a Haystack: Tiny Flying Object Detection in 4K
Videos using a Joint Detection-and-Tracking Approach [19.59528430884104]
We present a neural network model called the Recurrent Correlational Network, where detection and tracking are jointly performed.
In experiments with datasets containing images of scenes with small flying objects, such as birds and unmanned aerial vehicles, the proposed method yielded consistent improvements.
Our network performs as well as state-of-the-art generic object trackers when it was evaluated as a tracker on a bird image dataset.
arXiv Detail & Related papers (2021-05-18T03:22:03Z) - Toward Accurate Person-level Action Recognition in Videos of Crowded
Scenes [131.9067467127761]
We focus on improving the action recognition by fully-utilizing the information of scenes and collecting new data.
Specifically, we adopt a strong human detector to detect spatial location of each frame.
We then apply action recognition models to learn thetemporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet.
arXiv Detail & Related papers (2020-10-16T13:08:50Z) - TAO: A Large-Scale Benchmark for Tracking Any Object [95.87310116010185]
Tracking Any Object dataset consists of 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average.
We ask annotators to label objects that move at any point in the video, and give names to them post factum.
Our vocabulary is both significantly larger and qualitatively different from existing tracking datasets.
arXiv Detail & Related papers (2020-05-20T21:07:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.