SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding
- URL: http://arxiv.org/abs/2404.01174v2
- Date: Thu, 23 May 2024 10:53:12 GMT
- Title: SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding
- Authors: Wenrui Li, Xiaopeng Hong, Ruiqin Xiong, Xiaopeng Fan,
- Abstract summary: We introduce SpikeMba: a multi-modal spiking saliency mamba for temporal video grounding.
Our approach integrates Spiking Neural Networks (SNNs) with state space models (SSMs) to leverage their unique advantages.
Our experiments demonstrate the effectiveness of SpikeMba, which consistently outperforms state-of-the-art methods.
- Score: 50.337896542603524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal video grounding (TVG) is a critical task in video content understanding, requiring precise alignment between video content and natural language instructions. Despite significant advancements, existing methods face challenges in managing confidence bias towards salient objects and capturing long-term dependencies in video sequences. To address these issues, we introduce SpikeMba: a multi-modal spiking saliency mamba for temporal video grounding. Our approach integrates Spiking Neural Networks (SNNs) with state space models (SSMs) to leverage their unique advantages in handling different aspects of the task. Specifically, we use SNNs to develop a spiking saliency detector that generates the proposal set. The detector emits spike signals when the input signal exceeds a predefined threshold, resulting in a dynamic and binary saliency proposal set. To enhance the model's capability to retain and infer contextual information, we introduce relevant slots which learnable tensors that encode prior knowledge. These slots work with the contextual moment reasoner to maintain a balance between preserving contextual information and exploring semantic relevance dynamically. The SSMs facilitate selective information propagation, addressing the challenge of long-term dependency in video content. By combining SNNs for proposal generation and SSMs for effective contextual reasoning, SpikeMba addresses confidence bias and long-term dependencies, thereby significantly enhancing fine-grained multimodal relationship capture. Our experiments demonstrate the effectiveness of SpikeMba, which consistently outperforms state-of-the-art methods across mainstream benchmarks.
Related papers
- Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models [53.235170710385006]
We introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner.
We sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge.
In experiments, Grounded-VideoLLM excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA.
arXiv Detail & Related papers (2024-10-04T10:04:37Z) - The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling [19.004339956475498]
MAVIN is designed to generate transition videos that seamlessly connect two given videos, forming a cohesive integrated sequence.
We introduce a new metric, CLIP-RS (CLIP Relative Smoothness), to evaluate temporal coherence and smoothness, complementing traditional quality-based metrics.
Experimental results on horse and tiger scenarios demonstrate MAVIN's superior performance in generating smooth and coherent video transitions.
arXiv Detail & Related papers (2024-05-28T09:46:09Z) - Grounding-Prompter: Prompting LLM with Multimodal Information for
Temporal Sentence Grounding in Long Videos [42.32528440002539]
Temporal Sentence Grounding (TSG) aims to localize moments from videos based on the given natural language queries.
Existing works are mainly designed for short videos, failing to handle TSG in long videos.
We propose a Grounding-Prompter method, which is capable of conducting TSG in long videos through prompting LLM with multimodal information.
arXiv Detail & Related papers (2023-12-28T16:54:21Z) - Temporal Sentence Grounding in Streaming Videos [60.67022943824329]
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV)
The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query.
We propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames.
arXiv Detail & Related papers (2023-08-14T12:30:58Z) - You Need to Read Again: Multi-granularity Perception Network for Moment
Retrieval in Videos [19.711703590063976]
We propose a novel Multi-Granularity Perception Network (MGPN) that perceives intra-modality and inter-modality information at a multi-granularity level.
Specifically, we formulate moment retrieval as a multi-choice reading comprehension task and integrate human reading strategies into our framework.
arXiv Detail & Related papers (2022-05-25T16:15:46Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Event and Activity Recognition in Video Surveillance for Cyber-Physical
Systems [0.0]
Long-term motion patterns alone play a pivotal role in the task of recognizing an event.
We show that the long-term motion patterns alone play a pivotal role in the task of recognizing an event.
Only the temporal features are exploited using a hybrid Convolutional Neural Network (CNN) + Recurrent Neural Network (RNN) architecture.
arXiv Detail & Related papers (2021-11-03T08:30:38Z) - Spatiotemporal Inconsistency Learning for DeepFake Video Detection [51.747219106855624]
We present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions.
And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation.
arXiv Detail & Related papers (2021-09-04T13:05:37Z) - Long Short-Term Relation Networks for Video Action Detection [155.13392337831166]
Long Short-Term Relation Networks (LSTR) are presented in this paper.
LSTR aggregates and propagates relation to augment features for video action detection.
Extensive experiments are conducted on four benchmark datasets.
arXiv Detail & Related papers (2020-03-31T10:02:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.