SnapCap: Efficient Snapshot Compressive Video Captioning
- URL: http://arxiv.org/abs/2401.04903v1
- Date: Wed, 10 Jan 2024 03:11:21 GMT
- Title: SnapCap: Efficient Snapshot Compressive Video Captioning
- Authors: Jianqiao Sun, Yudi Su, Hao Zhang, Ziheng Cheng, Zequn Zeng, Zhengjue
Wang, Bo Chen, Xin Yuan
- Abstract summary: Video Captioning (VC) is a challenging multi-modal task since it requires describing the scene in language by understanding various and complex videos.
In this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, which can be captured by a snapshot compressive sensing camera.
To better extract language-related visual representations from the compressed measurement, we propose to distill the knowledge from videos via a pre-trained CLIP.
- Score: 18.016261978231835
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Captioning (VC) is a challenging multi-modal task since it requires
describing the scene in language by understanding various and complex videos.
For machines, the traditional VC follows the
"imaging-compression-decoding-and-then-captioning" pipeline, where compression
is pivot for storage and transmission. However, in such a pipeline, some
potential shortcomings are inevitable, i.e., information redundancy resulting
in low efficiency and information loss during the sampling process for
captioning. To address these problems, in this paper, we propose a novel VC
pipeline to generate captions directly from the compressed measurement, which
can be captured by a snapshot compressive sensing camera and we dub our model
SnapCap. To be more specific, benefiting from the signal simulation, we have
access to obtain abundant measurement-video-annotation data pairs for our
model. Besides, to better extract language-related visual representations from
the compressed measurement, we propose to distill the knowledge from videos via
a pre-trained CLIP with plentiful language-vision associations to guide the
learning of our SnapCap. To demonstrate the effectiveness of SnapCap, we
conduct experiments on two widely-used VC datasets. Both the qualitative and
quantitative results verify the superiority of our pipeline over conventional
VC pipelines. In particular, compared to the "caption-after-reconstruction"
methods, our SnapCap can run at least 3$\times$ faster, and achieve better
caption results.
Related papers
- AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [73.62572976072578]
We propose AuroraCap, a video captioner based on a large multimodal model.
We implement the token merging strategy, reducing the number of input visual tokens.
AuroraCap shows superior performance on various video and image captioning benchmarks.
arXiv Detail & Related papers (2024-10-04T00:13:54Z) - VCR: Visual Caption Restoration [80.24176572093512]
We introduce Visual Caption Restoration (VCR), a vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images.
This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images.
arXiv Detail & Related papers (2024-06-10T16:58:48Z) - Retrieval Enhanced Zero-Shot Video Captioning [69.96136689829778]
We bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2.
To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP.
Experiments show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - Accurate and Fast Compressed Video Captioning [28.19362369787383]
Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process.
We study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline.
We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning.
arXiv Detail & Related papers (2023-09-22T13:43:22Z) - Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End
3D Dense Captioning [37.44886367452029]
3D dense captioning requires a model to translate its understanding of an input 3D scene into several captions associated with different object regions.
Existing methods adopt a sophisticated "detect-then-describe" pipeline, which builds explicit relation modules upon a 3D detector with numerous hand-crafted components.
We first propose Vote2Cap-DETR, a simple-yet-effective transformer framework that decouples the decoding process of caption generation and object localization through parallel decoding.
arXiv Detail & Related papers (2023-09-06T13:43:27Z) - DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z) - Retrieval-augmented Image Captioning [15.266569206458648]
We present a new approach to image captioning that generates sentences given the input image and a set of captions retrieved from a datastore.
The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT.
Our work contributes towards using pretrained V&L encoders for generative tasks, instead of standard classification tasks.
arXiv Detail & Related papers (2023-02-16T12:54:13Z) - Learning Audio-Video Modalities from Image Captions [62.772232865072745]
A major challenge in text-video and text-audio retrieval is the lack of large-scale training data.
We propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.
We show that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning, matching or even outperforming HowTo100M pretraining with 20x fewer clips.
arXiv Detail & Related papers (2022-04-01T19:48:18Z) - End-to-End Dense Video Captioning with Parallel Decoding [53.34238344647624]
We propose a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC)
PDVC precisely segments the video into a number of event pieces under the holistic understanding of the video content.
experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results.
arXiv Detail & Related papers (2021-08-17T17:39:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.