Related papers: Visual Semantic Multimedia Event Model for Complex Event Detection in Video Streams

Visual Semantic Multimedia Event Model for Complex Event Detection in Video Streams

URL: http://arxiv.org/abs/2009.14525v1
Date: Wed, 30 Sep 2020 09:22:23 GMT
Title: Visual Semantic Multimedia Event Model for Complex Event Detection in Video Streams
Authors: Piyush Yadav, Edward Curry
Abstract summary: Middleware systems such as complex event processing (CEP) mine patterns from data streams and send notifications to users in a timely fashion. We present a visual event specification method to enable complex structured event processing by creating a structured knowledge representation from low-level media streams.
Score: 5.53329677986653
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimedia data is highly expressive and has traditionally been very difficult for a machine to interpret. Middleware systems such as complex event processing (CEP) mine patterns from data streams and send notifications to users in a timely fashion. Presently, CEP systems have inherent limitations to process multimedia streams due to its data complexity and the lack of an underlying structured data model. In this work, we present a visual event specification method to enable complex multimedia event processing by creating a semantic knowledge representation derived from low-level media streams. The method enables the detection of high-level semantic concepts from the media streams using an ensemble of pattern detection capabilities. The semantic model is aligned with a multimedia CEP engine deep learning models to give flexibility to end-users to build rules using spatiotemporal event calculus. This enhances CEP capability to detect patterns from media streams and bridge the semantic gap between highly expressive knowledge-centric user queries to the low-level features of the multi-media data. We have built a small traffic event ontology prototype to validate the approach and performance. The paper contribution is threefold: i) we present a knowledge graph representation for multimedia streams, ii) a hierarchical event network to detect visual patterns from media streams and iii) define complex pattern rules for complex multimedia event reasoning using event calculus

Related papers

ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization [14.920403124245867]
We introduce multi-stage semantic guidance and multi-event relationship modeling.<n>This enables hierarchical semantic understanding of audio-visual events and adaptive extraction of event dependencies.<n>Our method significantly surpasses the state-of-the-art methods, while greatly reducing parameters and computational load.
arXiv Detail & Related papers (2025-07-14T05:42:00Z)
PresentAgent: Multimodal Agent for Presentation Video Generation [30.274831875701217]
We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos.<n>To achieve this integration, PresentAgent employs a modular pipeline that segments the input document, plans and renders slide-style visual frames.<n>Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models.
arXiv Detail & Related papers (2025-07-05T13:24:15Z)
PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling [78.61911985138795]
We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams.<n>We propose the Predictive Future Modeling framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues.<n>Experiments show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters.
arXiv Detail & Related papers (2025-05-29T06:46:19Z)
A New Hybrid Intelligent Approach for Multimodal Detection of Suspected Disinformation on TikTok [0.0]
This study introduces a hybrid framework that combines the computational power of deep learning with the interpretability of fuzzy logic to detect suspected disinformation in TikTok videos. The methodology is comprised of two core components: a multimodal feature analyser that extracts and evaluates data from text, audio, and video; and a multimodal disinformation detector based on fuzzy logic.
arXiv Detail & Related papers (2025-02-09T12:37:48Z)
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning [56.873534081386]
A new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning. We propose a query-centric audio-visual cognition network to construct a reliable multi-modal representation for three tasks. This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks.
arXiv Detail & Related papers (2024-12-18T06:43:06Z)
Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task. We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities. Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z)
Detecting Misinformation in Multimedia Content through Cross-Modal Entity Consistency: A Dual Learning Approach [10.376378437321437]
We propose a Multimedia Misinformation Detection framework for detecting misinformation from video content by leveraging cross-modal entity consistency. Our results demonstrate that MultiMD outperforms state-of-the-art baseline models.
arXiv Detail & Related papers (2024-08-16T16:14:36Z)
MMUTF: Multimodal Multimedia Event Argument Extraction with Unified Template Filling [4.160176518973659]
We introduce a unified template filling model that connects the textual and visual modalities via textual prompts. Our system surpasses the current SOTA on textual EAE by +7% F1, and performs generally better than the second-best systems for multimedia EAE.
arXiv Detail & Related papers (2024-06-18T09:14:17Z)
Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled. First, we quantize and embed both text and visual prompt into a unified representational space. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Support-set based Multi-modal Representation Enhancement for Video Captioning [121.70886789958799]
We propose a Support-set based Multi-modal Representation Enhancement (SMRE) model to mine rich information in a semantic subspace shared between samples. Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements. During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way.
arXiv Detail & Related papers (2022-05-19T03:40:29Z)
Reliable Shot Identification for Complex Event Detection via Visual-Semantic Embedding [72.9370352430965]
We propose a visual-semantic guided loss method for event detection in videos. Motivated by curriculum learning, we introduce a negative elastic regularization term to start training the classifier with instances of high reliability. An alternative optimization algorithm is developed to solve the proposed challenging non-net regularization problem.
arXiv Detail & Related papers (2021-10-12T11:46:56Z)
METEOR: Learning Memory and Time Efficient Representations from Multi-modal Data Streams [19.22829945777267]
We present METEOR, a novel MEmory and Time Efficient Online Representation learning technique. We show that METEOR preserves the quality of the representations while reducing memory usage by around 80% compared to the conventional memory-intensive embeddings.
arXiv Detail & Related papers (2020-07-23T08:18:02Z)
VidCEP: Complex Event Processing Framework to Detect Spatiotemporal Patterns in Video Streams [5.53329677986653]
Middleware systems such as Complex Event Processing (CEP) mine patterns from data streams and send notifications to users in a timely fashion. Current CEP systems have inherent limitations to query video streams due to their unstructured data model and expressive query language. We propose VidCEP, an in-memory, near real-time complex event matching framework for video streams.
arXiv Detail & Related papers (2020-07-15T16:43:37Z)
Multimodal Categorization of Crisis Events in Social Media [81.07061295887172]
We present a new multimodal fusion method that leverages both images and texts as input. In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities. We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
arXiv Detail & Related papers (2020-04-10T06:31:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.