Live Video Captioning
- URL: http://arxiv.org/abs/2406.14206v2
- Date: Mon, 26 May 2025 12:27:03 GMT
- Title: Live Video Captioning
- Authors: Eduardo Blanco-Fernández, Carlos Gutiérrez-Álvarez, Nadia Nasri, Saturnino Maldonado-Bascón, Roberto J. López-Sastre,
- Abstract summary: We introduce a groundbreaking paradigm: Live Video Captioning (LVC), where captions must be generated for video streams in an online manner.<n>We formally define the novel problem of LVC and propose innovative evaluation metrics specifically designed for this online scenario.<n>We present a new model that combines deformable transformers with temporal filtering, enabling effective captioning over video streams.
- Score: 0.6291443816903801
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dense video captioning involves detecting and describing events within video sequences. Traditional methods operate in an offline setting, assuming the entire video is available for analysis. In contrast, in this work we introduce a groundbreaking paradigm: Live Video Captioning (LVC), where captions must be generated for video streams in an online manner. This shift brings unique challenges, including processing partial observations of the events and the need for a temporal anticipation of the actions. We formally define the novel problem of LVC and propose innovative evaluation metrics specifically designed for this online scenario, demonstrating their advantages over traditional metrics. To address the novel complexities of LVC, we present a new model that combines deformable transformers with temporal filtering, enabling effective captioning over video streams. Extensive experiments on the ActivityNet Captions dataset validate the proposed approach, showcasing its superior performance in the LVC setting compared to state-of-the-art offline methods. To foster further research, we provide the results of our model and an evaluation toolkit with the new metrics integrated at: https://github.com/gramuah/lvc.
Related papers
- DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment [17.85550556489256]
This paper propose a Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment (DVLTA-VQA)
A Video-Based Temporal CLIP module is proposed to explicitly model temporal dynamics and enhance motion perception, aligning with the dorsal stream.
A Temporal Context Module is developed to refine inter-frame dependencies, further improving motion modeling.
Finally, a text-guided adaptive fusion strategy is proposed to enable more effective integration of spatial and temporal information.
arXiv Detail & Related papers (2025-04-16T03:20:28Z) - Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning [71.94122309290537]
We propose an efficient, online approach to generate dense captions for videos.
Our model uses a novel autoregressive factorized decoding architecture.
Our approach shows excellent performance compared to both offline and online methods, and uses 20% less compute.
arXiv Detail & Related papers (2024-11-22T02:46:44Z) - CLIPVQA:Video Quality Assessment via CLIP [56.94085651315878]
We propose an efficient CLIP-based Transformer method for the VQA problem ( CLIPVQA)
The proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods.
arXiv Detail & Related papers (2024-07-06T02:32:28Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding [108.79026216923984]
Video grounding aims to localize a-temporal section in a video corresponding to an input text query.
This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task.
arXiv Detail & Related papers (2023-12-31T13:53:37Z) - READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling [31.745255364708864]
We introduce lightweight adapters to the pre-trained model and only update them at fine-tuning time.
We propose a novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability.
We validate our READ framework through extensive experiments where READ significantly outperforms all existing fine-tuning strategies.
arXiv Detail & Related papers (2023-12-12T03:09:30Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Online Anomaly Detection over Live Social Video Streaming [17.73632683825434]
Social video anomaly detection plays a critical role in applications from e-commerce to e-learning.
Traditionally, anomaly detection techniques are applied to find anomalies in video broadcasting.
We propose a generic framework for effectively online detecting Anomalies Over social Video LIve Streaming.
arXiv Detail & Related papers (2023-12-01T23:30:45Z) - Temporal Sentence Grounding in Streaming Videos [60.67022943824329]
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV)
The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query.
We propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames.
arXiv Detail & Related papers (2023-08-14T12:30:58Z) - In Defense of Clip-based Video Relation Detection [32.05021939177942]
Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries.
We propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips.
Our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.
arXiv Detail & Related papers (2023-07-18T05:42:01Z) - FODVid: Flow-guided Object Discovery in Videos [12.792602427704395]
We focus on building a generalizable solution that avoids overfitting to the individual intricacies.
To solve Video Object (VOS) in an unsupervised setting, we propose a new pipeline (FODVid) based on the idea of guiding segmentation outputs.
arXiv Detail & Related papers (2023-07-10T07:55:42Z) - Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment [10.567291051485194]
We propose ZeroTA, a novel method for dense video captioning in a zero-shot manner.
Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time.
arXiv Detail & Related papers (2023-07-05T23:01:26Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - DVCFlow: Modeling Information Flow Towards Human-like Video Captioning [163.71539565491113]
Existing methods mainly generate captions from individual video segments, lacking adaptation to the global visual context.
We introduce the concept of information flow to model the progressive information changing across video sequence and captions.
Our method significantly outperforms competitive baselines, and generates more human-like text according to subject and objective tests.
arXiv Detail & Related papers (2021-11-19T10:46:45Z) - End-to-End Dense Video Captioning with Parallel Decoding [53.34238344647624]
We propose a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC)
PDVC precisely segments the video into a number of event pieces under the holistic understanding of the video content.
experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results.
arXiv Detail & Related papers (2021-08-17T17:39:15Z) - Delving Deeper into the Decoder for Video Captioning [23.202746094988715]
Video captioning is an advanced multi-modal task which aims to describe a video clip using a natural language sentence.
We make a thorough investigation into the decoder and adopt three techniques to improve the performance of the model.
It is demonstrated in the experiments on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets.
arXiv Detail & Related papers (2020-01-16T02:18:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.