Garbage in, garbage out: Zero-shot detection of crime using Large
Language Models
- URL: http://arxiv.org/abs/2307.06844v1
- Date: Tue, 4 Jul 2023 01:29:15 GMT
- Title: Garbage in, garbage out: Zero-shot detection of crime using Large
Language Models
- Authors: Anj Simmons, Rajesh Vasa
- Abstract summary: We show that when video is (manually) converted to high quality textual descriptions, large language models are capable of detecting and classifying crimes.
Existing automated video-to-text approaches are unable to generate video descriptions of sufficient quality to support reasoning.
- Score: 1.113911383207731
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes exploiting the common sense knowledge learned by large
language models to perform zero-shot reasoning about crimes given textual
descriptions of surveillance videos. We show that when video is (manually)
converted to high quality textual descriptions, large language models are
capable of detecting and classifying crimes with state-of-the-art performance
using only zero-shot reasoning. However, existing automated video-to-text
approaches are unable to generate video descriptions of sufficient quality to
support reasoning (garbage video descriptions into the large language model,
garbage out).
Related papers
- Distilling Vision-Language Models on Millions of Videos [62.92789440875999]
We fine-tune a video-language model from a strong image-language baseline with synthesized instructional data.
The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions.
As a side product, we generate the largest video caption dataset to date.
arXiv Detail & Related papers (2024-01-11T18:59:53Z) - Generating Enhanced Negatives for Training Language-Based Object Detectors [86.1914216335631]
We propose to leverage the vast knowledge built into modern generative models to automatically build negatives that are more relevant to the original data.
Specifically, we use large-language-models to generate negative text descriptions, and text-to-image diffusion models to also generate corresponding negative images.
Our experimental analysis confirms the relevance of the generated negative data, and its use in language-based detectors improves performance on two complex benchmarks.
arXiv Detail & Related papers (2023-12-29T23:04:00Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Models See Hallucinations: Evaluating the Factuality in Video Captioning [57.85548187177109]
We conduct a human evaluation of the factuality in video captioning and collect two annotated factuality datasets.
We find that 57.0% of the model-generated sentences have factual errors, indicating it is a severe problem in this field.
We propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning.
arXiv Detail & Related papers (2023-03-06T08:32:50Z) - Learning a Grammar Inducer from Massive Uncurated Instructional Videos [118.7279072358029]
Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text.
We build a new model that can better learn video-span correlation without manually designed features.
Our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.
arXiv Detail & Related papers (2022-10-22T00:22:55Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - Zero-shot Natural Language Video Localization [11.522385805128001]
We make a first attempt to train a natural language video localization model in zero-shot manner.
Inspired by unsupervised image captioning setup, we merely require random text corpora, unlabeled video collections, and an off-the-shelf object detector to train a model.
arXiv Detail & Related papers (2021-08-29T13:21:50Z) - TNT: Text-Conditioned Network with Transductive Inference for Few-Shot
Video Classification [26.12591949900602]
We formulate a text-based task conditioner to adapt video features to the few-shot learning task.
Our model obtains state-of-the-art performance on four challenging benchmarks in few-shot video action classification.
arXiv Detail & Related papers (2021-06-21T15:08:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.