VideoScoop: A Non-Traditional Domain-Independent Framework For Video Analysis
- URL: http://arxiv.org/abs/2512.01769v1
- Date: Mon, 01 Dec 2025 15:09:46 GMT
- Title: VideoScoop: A Non-Traditional Domain-Independent Framework For Video Analysis
- Authors: Hafsa Billah,
- Abstract summary: Video Situation Analysis (VSA) is done manually with a human in the loop, which is error-prone and labor-intensive.<n>This report proposes a general-purpose VSA framework that overcomes the above limitations.<n>Video contents are extracted once using state-of-the-art Video Content Extraction technologies.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automatically understanding video contents is important for several applications in Civic Monitoring (CM), general Surveillance (SL), Assisted Living (AL), etc. Decades of Image and Video Analysis (IVA) research have advanced tasks such as content extraction (e.g., object recognition and tracking). Identifying meaningful activities or situations (e.g., two objects coming closer) remains difficult and cannot be achieved by content extraction alone. Currently, Video Situation Analysis (VSA) is done manually with a human in the loop, which is error-prone and labor-intensive, or through custom algorithms designed for specific video types or situations. These algorithms are not general-purpose and require a new algorithm/software for each new situation or video from a new domain. This report proposes a general-purpose VSA framework that overcomes the above limitations. Video contents are extracted once using state-of-the-art Video Content Extraction technologies. They are represented using two alternative models -- the extended relational model (R++) and graph models. When represented using R++, the extracted contents can be used as data streams, enabling Continuous Query Processing via the proposed Continuous Query Language for Video Analysis. The graph models complement this by enabling the detection of situations that are difficult or impossible to detect using the relational model alone. Existing graph algorithms and newly developed algorithms support a wide variety of situation detection. To support domain independence, primitive situation variants across domains are identified and expressed as parameterized templates. Extensive experiments were conducted across several interesting situations from three domains -- AL, CM, and SL-- to evaluate the accuracy, efficiency, and robustness of the proposed approach using a dataset of videos of varying lengths from these domains.
Related papers
- Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning [32.71093573332936]
VideoDR is the first video deep research benchmark for studying video agents in open-web settings.<n>VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence.
arXiv Detail & Related papers (2026-01-11T15:07:37Z) - VUDG: A Dataset for Video Understanding Domain Generalization [29.27464392754555]
Video Understanding Domain Generalization (VUDG) is an annotated dataset designed specifically for evaluating the DG performance in video understanding.<n>VUDG contains videos from 11 distinct domains that cover three types of domain shifts, and maintains semantic similarity across different domains to ensure fair and meaningful evaluation.
arXiv Detail & Related papers (2025-05-30T08:39:36Z) - AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM [2.5988879420706095]
Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision.<n>Existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments.<n>This study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model.
arXiv Detail & Related papers (2025-03-06T14:52:34Z) - Real-Time Anomaly Detection in Video Streams [0.0]
This thesis is part of a CIFRE agreement between the company Othello and the LIASD laboratory.<n>The objective is to develop an artificial intelligence system that can detect real-time dangers in a video stream.
arXiv Detail & Related papers (2024-11-29T14:24:33Z) - Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - Generative Video Diffusion for Unseen Novel Semantic Video Moment Retrieval [54.22321767540878]
Video moment retrieval (VMR) aims to locate the most likely video moment corresponding to a text query in untrimmed videos.<n>Training of existing methods is limited by the lack of diverse and generalisable VMR datasets.<n>We propose a Fine-grained Video Editing framework, termed FVE, that explores generative video diffusion.
arXiv Detail & Related papers (2024-01-24T09:45:40Z) - Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains [4.9347081318119015]
We introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) video question answering.<n>In tandem, the two tasks quantify a model's ability to: (1) generalize to novel domains; (2) utilize long temporal context and multimodal (e.g. visual and speech) information.<n>We discover a promising adaptation via summarization technique that leads to significant performance improvement without model fine-tuning.
arXiv Detail & Related papers (2023-11-30T18:19:23Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Video Exploration via Video-Specific Autoencoders [60.256055890647595]
We present video-specific autoencoders that enables human-controllable video exploration.
We observe that a simple autoencoder trained on multiple frames of a specific video enables one to perform a large variety of video processing and editing tasks.
arXiv Detail & Related papers (2021-03-31T17:56:13Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.