Constructing a Highlight Classifier with an Attention Based LSTM Neural
Network
- URL: http://arxiv.org/abs/2002.04608v1
- Date: Wed, 12 Feb 2020 15:18:31 GMT
- Title: Constructing a Highlight Classifier with an Attention Based LSTM Neural
Network
- Authors: Michael Kuehne and Marius Radu
- Abstract summary: Market researchers manually review the vast majority of consumer research video in order to identify relevant portions - highlights.
In this study we present a novel approach for NLP-based highlight identification and extraction based on a supervised learning model.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data is being produced in larger quantities than ever before in human
history. It's only natural to expect a rise in demand for technology that aids
humans in sifting through and analyzing this inexhaustible supply of
information. This need exists in the market research industry, where large
amounts of consumer research data is collected through video recordings. At
present, the standard method for analyzing video data is human labor. Market
researchers manually review the vast majority of consumer research video in
order to identify relevant portions - highlights. The industry state of the art
turnaround ratio is 2.2 - for every hour of video content 2.2 hours of manpower
are required. In this study we present a novel approach for NLP-based highlight
identification and extraction based on a supervised learning model that aides
market researchers in sifting through their data. Our approach hinges on a
manually curated user-generated highlight clips constructed from long and
short-form video data. The problem is best suited for an NLP approach due to
the availability of video transcription. We evaluate multiple classes of
models, from gradient boosting to recurrent neural networks, comparing their
performance in extraction and identification of highlights. The best performing
models are then evaluated using four sampling methods designed to analyze
documents much larger than the maximum input length of the classifiers. We
report very high performances for the standalone classifiers, ROC AUC scores in
the range 0.93-0.94, but observe a significant drop in effectiveness when
evaluated on large documents. Based on our results we suggest combinations of
models/sampling algorithms for various use cases.
Related papers
- Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - CapST: An Enhanced and Lightweight Model Attribution Approach for
Synthetic Videos [9.209808258321559]
This paper investigates the model attribution problem of Deepfake videos from a recently proposed dataset, Deepfakes from Different Models (DFDM)
The dataset comprises 6,450 Deepfake videos generated by five distinct models with variations in encoder, decoder, intermediate layer, input resolution, and compression ratio.
Experimental results on the deepfake benchmark dataset (DFDM) demonstrate the efficacy of our proposed method, achieving up to a 4% improvement in accurately categorizing deepfake videos.
arXiv Detail & Related papers (2023-11-07T08:05:09Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - SurgMAE: Masked Autoencoders for Long Surgical Video Analysis [4.866110274299399]
Masked autoencoders (MAE) got the attention in self-supervised paradigm for Vision Transformers (ViTs)
In this paper, we first investigate whether MAE can learn transferrable representations in surgical video domain.
We propose SurgMAE, which is a novel architecture with a masking strategy on sampling high-temporal tokens for MAE.
arXiv Detail & Related papers (2023-05-19T06:12:50Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos.
We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries.
An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.