Visual Semantic Multimedia Event Model for Complex Event Detection in
Video Streams
- URL: http://arxiv.org/abs/2009.14525v1
- Date: Wed, 30 Sep 2020 09:22:23 GMT
- Title: Visual Semantic Multimedia Event Model for Complex Event Detection in
Video Streams
- Authors: Piyush Yadav, Edward Curry
- Abstract summary: Middleware systems such as complex event processing (CEP) mine patterns from data streams and send notifications to users in a timely fashion.
We present a visual event specification method to enable complex structured event processing by creating a structured knowledge representation from low-level media streams.
- Score: 5.53329677986653
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimedia data is highly expressive and has traditionally been very
difficult for a machine to interpret. Middleware systems such as complex event
processing (CEP) mine patterns from data streams and send notifications to
users in a timely fashion. Presently, CEP systems have inherent limitations to
process multimedia streams due to its data complexity and the lack of an
underlying structured data model. In this work, we present a visual event
specification method to enable complex multimedia event processing by creating
a semantic knowledge representation derived from low-level media streams. The
method enables the detection of high-level semantic concepts from the media
streams using an ensemble of pattern detection capabilities. The semantic model
is aligned with a multimedia CEP engine deep learning models to give
flexibility to end-users to build rules using spatiotemporal event calculus.
This enhances CEP capability to detect patterns from media streams and bridge
the semantic gap between highly expressive knowledge-centric user queries to
the low-level features of the multi-media data. We have built a small traffic
event ontology prototype to validate the approach and performance. The paper
contribution is threefold: i) we present a knowledge graph representation for
multimedia streams, ii) a hierarchical event network to detect visual patterns
from media streams and iii) define complex pattern rules for complex multimedia
event reasoning using event calculus
Related papers
- Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task.
We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities.
Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z) - Detecting Misinformation in Multimedia Content through Cross-Modal Entity Consistency: A Dual Learning Approach [10.376378437321437]
We propose a Multimedia Misinformation Detection framework for detecting misinformation from video content by leveraging cross-modal entity consistency.
Our results demonstrate that MultiMD outperforms state-of-the-art baseline models.
arXiv Detail & Related papers (2024-08-16T16:14:36Z) - MMUTF: Multimodal Multimedia Event Argument Extraction with Unified Template Filling [4.160176518973659]
We introduce a unified template filling model that connects the textual and visual modalities via textual prompts.
Our system surpasses the current SOTA on textual EAE by +7% F1, and performs generally better than the second-best systems for multimedia EAE.
arXiv Detail & Related papers (2024-06-18T09:14:17Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Support-set based Multi-modal Representation Enhancement for Video
Captioning [121.70886789958799]
We propose a Support-set based Multi-modal Representation Enhancement (SMRE) model to mine rich information in a semantic subspace shared between samples.
Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements.
During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way.
arXiv Detail & Related papers (2022-05-19T03:40:29Z) - Reliable Shot Identification for Complex Event Detection via
Visual-Semantic Embedding [72.9370352430965]
We propose a visual-semantic guided loss method for event detection in videos.
Motivated by curriculum learning, we introduce a negative elastic regularization term to start training the classifier with instances of high reliability.
An alternative optimization algorithm is developed to solve the proposed challenging non-net regularization problem.
arXiv Detail & Related papers (2021-10-12T11:46:56Z) - METEOR: Learning Memory and Time Efficient Representations from
Multi-modal Data Streams [19.22829945777267]
We present METEOR, a novel MEmory and Time Efficient Online Representation learning technique.
We show that METEOR preserves the quality of the representations while reducing memory usage by around 80% compared to the conventional memory-intensive embeddings.
arXiv Detail & Related papers (2020-07-23T08:18:02Z) - VidCEP: Complex Event Processing Framework to Detect Spatiotemporal
Patterns in Video Streams [5.53329677986653]
Middleware systems such as Complex Event Processing (CEP) mine patterns from data streams and send notifications to users in a timely fashion.
Current CEP systems have inherent limitations to query video streams due to their unstructured data model and expressive query language.
We propose VidCEP, an in-memory, near real-time complex event matching framework for video streams.
arXiv Detail & Related papers (2020-07-15T16:43:37Z) - Multimodal Categorization of Crisis Events in Social Media [81.07061295887172]
We present a new multimodal fusion method that leverages both images and texts as input.
In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities.
We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
arXiv Detail & Related papers (2020-04-10T06:31:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.