Related papers: Multimodal Memorability: Modeling Effects of Semantics and Decay on Video Memorability

Multimodal Memorability: Modeling Effects of Semantics and Decay on Video Memorability

URL: http://arxiv.org/abs/2009.02568v1
Date: Sat, 5 Sep 2020 17:24:02 GMT
Title: Multimodal Memorability: Modeling Effects of Semantics and Decay on Video Memorability
Authors: Anelise Newman, Camilo Fosco, Vincent Casser, Allen Lee, Barry McNamara, and Aude Oliva
Abstract summary: We develop a predictive model of human visual event memory and how those memories decay over time. We introduce Memento10k, a new, dynamic video memorability dataset containing human annotations at different viewing delays.
Score: 17.00485879591431
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A key capability of an intelligent system is deciding when events from past experience must be remembered and when they can be forgotten. Towards this goal, we develop a predictive model of human visual event memory and how those memories decay over time. We introduce Memento10k, a new, dynamic video memorability dataset containing human annotations at different viewing delays. Based on our findings we propose a new mathematical formulation of memorability decay, resulting in a model that is able to produce the first quantitative estimation of how a video decays in memory over time. In contrast with previous work, our model can predict the probability that a video will be remembered at an arbitrary delay. Importantly, our approach combines visual and semantic information (in the form of textual captions) to fully represent the meaning of events. Our experiments on two video memorability benchmarks, including Memento10k, show that our model significantly improves upon the best prior approach (by 12% on average).

Related papers

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
Treating Brain-inspired Memories as Priors for Diffusion Model to Forecast Multivariate Time Series [16.315066774520524]
We get inspiration from humans' memory mechanisms to better capture temporal patterns. Brain-inspired memory comprises semantic and episodic memory. We present a brain-inspired memory-augmented diffusion model.
arXiv Detail & Related papers (2024-09-27T07:09:40Z)
Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past. We leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z)
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering [36.00733800536469]
VideoQA has emerged as a vital tool to evaluate agents' ability to understand human daily behaviors. Humans can easily tackle it by using a series of episode memories as anchors to quickly locate question-related key moments for reasoning. We propose the Glance-Focus model to mimic this effective reasoning strategy.
arXiv Detail & Related papers (2024-01-03T03:51:16Z)
STDiff: Spatio-temporal Diffusion for Continuous Stochastic Video Prediction [20.701792842768747]
We propose a novel video prediction model, which has infinite-dimensional latent variables over the temporal domain. Our model is able to achieve temporal continuous prediction, i.e., predicting in an unsupervised way, with an arbitrarily high frame rate.
arXiv Detail & Related papers (2023-12-11T16:12:43Z)
Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability [21.44002657362493]
We adopt a simple CNN+Transformer architecture that enables analysis of features while fuse-temporal attention matching state-of-the-art (TASo) performance on video memorability prediction. We compare model attention against human fixations through a small-scale eye-tracking study where humans perform a memory memory task.
arXiv Detail & Related papers (2023-11-26T05:14:06Z)
Memory-and-Anticipation Transformer for Online Action Understanding [52.24561192781971]
We propose a novel memory-anticipation-based paradigm to model an entire temporal structure, including the past, present, and future. We present Memory-and-Anticipation Transformer (MAT), a memory-anticipation-based approach, to address the online action detection and anticipation tasks.
arXiv Detail & Related papers (2023-08-15T17:34:54Z)
Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp. SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z)
FitVid: Overfitting in Pixel-Level Video Prediction [117.59339756506142]
We introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks. FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.
arXiv Detail & Related papers (2021-06-24T17:20:21Z)
MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech. MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.