Accurate and Fast Compressed Video Captioning
- URL: http://arxiv.org/abs/2309.12867v2
- Date: Wed, 3 Jan 2024 08:29:03 GMT
- Title: Accurate and Fast Compressed Video Captioning
- Authors: Yaojie Shen, Xin Gu, Kai Xu, Heng Fan, Longyin Wen, Libo Zhang
- Abstract summary: Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process.
We study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline.
We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning.
- Score: 28.19362369787383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing video captioning approaches typically require to first sample video
frames from a decoded video and then conduct a subsequent process (e.g.,
feature extraction and/or captioning model learning). In this pipeline, manual
frame sampling may ignore key information in videos and thus degrade
performance. Additionally, redundant information in the sampled frames may
result in low efficiency in the inference of video captioning. Addressing this,
we study video captioning from a different perspective in compressed domain,
which brings multi-fold advantages over the existing pipeline: 1) Compared to
raw images from the decoded video, the compressed video, consisting of
I-frames, motion vectors and residuals, is highly distinguishable, which allows
us to leverage the entire video for learning without manual sampling through a
specialized model design; 2) The captioning model is more efficient in
inference as smaller and less redundant information is processed. We propose a
simple yet effective end-to-end transformer in the compressed domain for video
captioning that enables learning from the compressed video for captioning. We
show that even with a simple design, our method can achieve state-of-the-art
performance on different benchmarks while running almost 2x faster than
existing approaches. Code is available at https://github.com/acherstyx/CoCap.
Related papers
- SITAR: Semi-supervised Image Transformer for Action Recognition [20.609596080624662]
This paper addresses video action recognition in a semi-supervised setting by leveraging only a handful of labeled videos.
We capitalize on the vast pool of unlabeled samples and employ contrastive learning on the encoded super images.
Our method demonstrates superior performance compared to existing state-of-the-art approaches for semi-supervised action recognition.
arXiv Detail & Related papers (2024-09-04T17:49:54Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Streaming Dense Video Captioning [85.70265343236687]
An ideal model for dense video captioning should be able to handle long input videos, predict rich, detailed textual descriptions.
Current state-of-the-art models process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video.
We propose a streaming dense video captioning model that consists of two novel components.
arXiv Detail & Related papers (2024-04-01T17:59:15Z) - SnapCap: Efficient Snapshot Compressive Video Captioning [18.016261978231835]
Video Captioning (VC) is a challenging multi-modal task since it requires describing the scene in language by understanding various and complex videos.
In this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, which can be captured by a snapshot compressive sensing camera.
To better extract language-related visual representations from the compressed measurement, we propose to distill the knowledge from videos via a pre-trained CLIP.
arXiv Detail & Related papers (2024-01-10T03:11:21Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Compressed Vision for Efficient Video Understanding [83.97689018324732]
We propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos.
We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks.
arXiv Detail & Related papers (2022-10-06T15:35:49Z) - An Integrated Approach for Video Captioning and Applications [2.064612766965483]
We design hybrid deep learning architectures to apply in long videos by captioning videos.
We argue that linking images, videos, and natural language offers many practical benefits and immediate practical applications.
arXiv Detail & Related papers (2022-01-23T01:06:00Z) - SwinBERT: End-to-End Transformers with Sparse Attention for Video
Captioning [40.556222166309524]
We present SwinBERT, an end-to-end transformer-based model for video captioning.
Our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input.
Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames.
arXiv Detail & Related papers (2021-11-25T18:02:12Z) - Open-book Video Captioning with Retrieve-Copy-Generate Network [42.374461018847114]
In this paper, we convert traditional video captioning task into a new paradigm, ie, Open-book Video Captioning.
We propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively.
Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video.
arXiv Detail & Related papers (2021-03-09T08:17:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.