Large Language Models for Video Surveillance Applications
- URL: http://arxiv.org/abs/2501.02850v1
- Date: Mon, 06 Jan 2025 08:57:44 GMT
- Title: Large Language Models for Video Surveillance Applications
- Authors: Ulindu De Silva, Leon Fernando, Billy Lau Pik Lik, Zann Koh, Sam Conrad Joyce, Belinda Yuen, Chau Yuen,
- Abstract summary: This paper presents an innovative proof of concept using Generative Artificial Intelligence (GenAI) in the form of Vision Language Models.
Our tool generates customized textual summaries based on user-defined queries, providing focused insights within extensive video datasets.
- Score: 11.297664744056735
- License:
- Abstract: The rapid increase in video content production has resulted in enormous data volumes, creating significant challenges for efficient analysis and resource management. To address this, robust video analysis tools are essential. This paper presents an innovative proof of concept using Generative Artificial Intelligence (GenAI) in the form of Vision Language Models to enhance the downstream video analysis process. Our tool generates customized textual summaries based on user-defined queries, providing focused insights within extensive video datasets. Unlike traditional methods that offer generic summaries or limited action recognition, our approach utilizes Vision Language Models to extract relevant information, improving analysis precision and efficiency. The proposed method produces textual summaries from extensive CCTV footage, which can then be stored for an indefinite time in a very small storage space compared to videos, allowing users to quickly navigate and verify significant events without exhaustive manual review. Qualitative evaluations result in 80% and 70% accuracy in temporal and spatial quality and consistency of the pipeline respectively.
Related papers
- Understanding Long Videos via LLM-Powered Entity Relation Graphs [51.13422967711056]
GraphVideoAgent is a framework that maps and monitors the evolving relationships between visual entities throughout the video sequence.
Our approach demonstrates remarkable effectiveness when tested against industry benchmarks.
arXiv Detail & Related papers (2025-01-27T10:57:24Z) - Video Summarisation with Incident and Context Information using Generative AI [0.5522795407400668]
This paper proposes a novel approach leveraging Generative Artificial Intelligence (GenAI) to facilitate streamlined video analysis.
Our tool aims to deliver tailored textual summaries of user-defined queries, offering a focused insight amidst extensive video datasets.
arXiv Detail & Related papers (2025-01-08T18:35:48Z) - SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.
We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.
Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z) - Enhancing Video Summarization with Context Awareness [9.861215740353247]
Video summarization automatically generate concise summaries by selecting techniques, shots, or segments that capture the video's essence.
Despite the importance of video summarization, there is a lack of diverse and representative datasets.
We propose an unsupervised approach that leverages video data structure and information for generating informative summaries.
arXiv Detail & Related papers (2024-04-06T09:08:34Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - Leveraging Generative Language Models for Weakly Supervised Sentence
Component Analysis in Video-Language Joint Learning [10.486585276898472]
A thorough comprehension of textual data is a fundamental element in multi-modal video analysis tasks.
We postulate that understanding the significance of the sentence components according to the target task can potentially enhance the performance of the models.
We propose a weakly supervised importance estimation module to compute the relative importance of the components and utilize them to improve different video-language tasks.
arXiv Detail & Related papers (2023-12-10T02:03:51Z) - Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating
Video-based Large Language Models [81.84810348214113]
Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries.
To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial.
This paper proposes textitVideo-Bench, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs.
arXiv Detail & Related papers (2023-11-27T18:59:58Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.