Dense Video Captioning Using Unsupervised Semantic Information
- URL: http://arxiv.org/abs/2112.08455v1
- Date: Wed, 15 Dec 2021 20:03:42 GMT
- Title: Dense Video Captioning Using Unsupervised Semantic Information
- Authors: Valter Estevam and Rayson Laroca and Helio Pedrini and David Menotti
- Abstract summary: We introduce a method to learn unsupervised semantic visual information based on the premise that complex events can be decomposed into simpler events.
We split a long video into short frame sequences to extract their latent representation with three-dimensional convolutional neural networks.
We demonstrate how this representation can leverage the performance of the dense video captioning task in a scenario with only visual features.
- Score: 2.022555840231001
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a method to learn unsupervised semantic visual information based
on the premise that complex events (e.g., minutes) can be decomposed into
simpler events (e.g., a few seconds), and that these simple events are shared
across several complex events. We split a long video into short frame sequences
to extract their latent representation with three-dimensional convolutional
neural networks. A clustering method is used to group representations producing
a visual codebook (i.e., a long video is represented by a sequence of integers
given by the cluster labels). A dense representation is learned by encoding the
co-occurrence probability matrix for the codebook entries. We demonstrate how
this representation can leverage the performance of the dense video captioning
task in a scenario with only visual features. As a result of this approach, we
are able to replace the audio signal in the Bi-Modal Transformer (BMT) method
and produce temporal proposals with comparable performance. Furthermore, we
concatenate the visual signal with our descriptor in a vanilla transformer
method to achieve state-of-the-art performance in captioning compared to the
methods that explore only visual features, as well as a competitive performance
with multi-modal methods. Our code is available at
https://github.com/valterlej/dvcusi.
Related papers
- SITAR: Semi-supervised Image Transformer for Action Recognition [20.609596080624662]
This paper addresses video action recognition in a semi-supervised setting by leveraging only a handful of labeled videos.
We capitalize on the vast pool of unlabeled samples and employ contrastive learning on the encoded super images.
Our method demonstrates superior performance compared to existing state-of-the-art approaches for semi-supervised action recognition.
arXiv Detail & Related papers (2024-09-04T17:49:54Z) - Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment [10.567291051485194]
We propose ZeroTA, a novel method for dense video captioning in a zero-shot manner.
Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time.
arXiv Detail & Related papers (2023-07-05T23:01:26Z) - Visual Commonsense-aware Representation Network for Video Captioning [84.67432867555044]
We propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN) for video captioning.
Our method reaches state-of-the-art performance, indicating the effectiveness of our method.
arXiv Detail & Related papers (2022-11-17T11:27:15Z) - TVLT: Textless Vision-Language Transformer [89.31422264408002]
We present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs.
TVLT attains performance comparable to its text-based counterpart, on various multimodal tasks.
Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals.
arXiv Detail & Related papers (2022-09-28T15:08:03Z) - CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos.
This significantly increases computation costs and hinders the deployment of video retrieval models in web applications.
In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Show Me What and Tell Me How: Video Synthesis via Multimodal
Conditioning [36.85533835408882]
This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately.
We propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens.
Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images.
arXiv Detail & Related papers (2022-03-04T21:09:13Z) - End-to-End Dense Video Captioning with Parallel Decoding [53.34238344647624]
We propose a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC)
PDVC precisely segments the video into a number of event pieces under the holistic understanding of the video content.
experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results.
arXiv Detail & Related papers (2021-08-17T17:39:15Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.